Hi, I tried your code. Very good work so far! Congratulations.
Is the examples/result file corrupted? It has only one line. Do you plan to implement a simple CLI to use it interactively from command line, similar to bin/opennlp Doccat bin/opennlp TokenNameFinder ? Also, do you plan to add evaluation tools by extending AbstractEvaluatorTool and AbstractCrossValidatorTool, as well as the listener EvaluationErrorPrinter? I found these tools very useful while I am developing new models and features, maybe you would find it useful as well. You could also check the DoccatFineGrainedReportListener as a start point to create a confusion matrix (I think it would be easy because Doccat data structures are similar to yours). The result would look like the follow (this is a 300 entries Portuguese corpus I am building from Facebook messages): === Evaluation summary === Number of documents: 298 Min sentence size: 1 Max sentence size: 463 Average sentence size: 18,01 Categories count: 4 Accuracy: 61,41% === Detailed Accuracy By Tag === ------------------------------------------------------------------------- | Tag | Errors | Count | % Err | Precision | Recall | F-Measure | ------------------------------------------------------------------------- | neutral | 46 | 56 | 0,821 | 0,588 | 0,179 | 0,274 | | positive | 46 | 70 | 0,657 | 0,48 | 0,343 | 0,4 | | negative | 18 | 167 | 0,108 | 0,651 | 0,892 | 0,753 | | spam | 5 | 5 | 1 | 0 | 0 | 0 | ------------------------------------------------------------------------- === Confusion matrix === a b c d | Accuracy | <-- classified as <149> 13 4 1 | 89,22% | a = negative 42 <24> 3 1 | 34,29% | b = positive 35 11 <10> . | 17,86% | c = neutral 3 2 . <.>| 0% | d = spam Regards, William 2016-06-23 2:11 GMT-03:00 Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov>: > Thank you Jason! > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > On 6/22/16, 8:41 PM, "Jason Baldridge" <jasonbaldri...@gmail.com> wrote: > > >Anastasija, > > > >There might be a few appropriate sentiment datasets listed in my homework > >on Twitter sentiment analysis: > > > >https://github.com/utcompling/applied-nlp/wiki/Homework5 > > > >There may also be some useful data sets in the Crowdflower Open Data > >collection: > > > >https://www.crowdflower.com/data-for-everyone/ > > > >Hope this helps! > > > >-Jason > > > >On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova < > >mensikova.anastas...@gmail.com> wrote: > > > >> Hi everyone, > >> > >> Some updates on our Sentiment Analysis Parser work. > >> > >> You might have noticed, I have enhanced our website (the GH page) > recently, > >> polished it and made it more user-friendly. My next step will be > sending a > >> pull request to Tika. However, my main goal until the end of Google > Summer > >> of Code is to enhance the parser in a way that will allow it to work > >> categorically (in other words, the sentiment determined won't be just > >> positive or negative, it will have a few categories). This means that my > >> next step is to look for a categorical open data set (which I will > >> hopefully do by the end of the weekend the latest) and, of course, > enhance > >> my model and training. After that I will look into how the confidence > >> levels can be increased. > >> > >> Have a great day/night! > >> > >> Thank you, > >> Anastasija Mensikova. > >> >