How about we keep track of the sets used for performance evaluation and results in this doc for now:
https://docs.google.com/spreadsheets/d/15c0-u61HNWfQxiDSGjk49M1uBknIfb-LkbP4BDWTB5w/edit?usp=sharing Will try to take a better look at OntoNotes and what to use from it. Otherwise, if anyone would like to suggest proper data-sets for testing each component that would be really helpful Anthony On Thu, Jun 23, 2016 at 12:18 AM, Joern Kottmann <kottm...@gmail.com> wrote: > It would be nice to get MASC support into the OpenNLP formats package. > > Jörn > > On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge <jasonbaldri...@gmail.com > > > wrote: > > > Jörn is absolutely right about that. Another good source of training data > > is MASC. I've got some instructions for training models with MASC here: > > > > https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial > > > > Chalk (now defunct) provided a Scala wrapper around OpenNLP > functionality, > > so the instructions there should make it fairly straightforward to adapt > > MASC data to OpenNLP. > > > > -Jason > > > > On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <kottm...@gmail.com> wrote: > > > > > There are some research papers which study and compare the performance > of > > > NLP toolkits, but be careful often they don't train the NLP tools on > the > > > same data and the training data makes a big difference on the > > performance. > > > > > > Jörn > > > > > > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <kottm...@gmail.com> > > > wrote: > > > > > > > Just don't use the very old existing models, to get good results you > > have > > > > to train on your own data, especially if the domain of the data used > > for > > > > training and the data which should be processed doesn't match. The > old > > > > models are trained on 90s news, those don't work well on todays news > > and > > > > probably much worse on tweets. > > > > > > > > OntoNots is a good place to start if the goal is to process news. > > OpenNLP > > > > comes with build-in support to train models from OntoNotes. > > > > > > > > Jörn > > > > > > > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) < > > > > chris.a.mattm...@jpl.nasa.gov> wrote: > > > > > > > >> This sounds like a fantastic idea. > > > >> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > >> Chris Mattmann, Ph.D. > > > >> Chief Architect > > > >> Instrument Software and Science Data Systems Section (398) > > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > > >> Office: 168-519, Mailstop: 168-527 > > > >> Email: chris.a.mattm...@nasa.gov > > > >> WWW: http://sunset.usc.edu/~mattmann/ > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > >> Director, Information Retrieval and Data Science Group (IRDS) > > > >> Adjunct Associate Professor, Computer Science Department > > > >> University of Southern California, Los Angeles, CA 90089 USA > > > >> WWW: http://irds.usc.edu/ > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" < > > anthonybeyler...@hotmail.com > > > > > > > >> wrote: > > > >> > > > >> >+1 > > > >> > > > > >> >Maybe we could put the results of the evaluator tests for each > > > component > > > >> somewhere on a webpage and on every release update them. > > > >> >This is of course provided there are reasonable data sets for > testing > > > >> each component. > > > >> >What do you think? > > > >> > > > > >> >Anthony > > > >> > > > > >> >> From: mondher.bouaz...@gmail.com > > > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900 > > > >> >> Subject: Re: Performances of OpenNLP tools > > > >> >> To: dev@opennlp.apache.org > > > >> >> > > > >> >> Hi, > > > >> >> > > > >> >> Thank you for your replies. > > > >> >> > > > >> >> Please Jeffrey accept once more my apologies for receiving the > > email > > > >> twice. > > > >> >> > > > >> >> I also think it would be great to have such studies on the > > > >> performances of > > > >> >> OpenNLP. > > > >> >> > > > >> >> I have been looking for this information and checked in many > > places, > > > >> >> including obviously google scholar, and I haven't found any > serious > > > >> studies > > > >> >> or reliable results. Most of the existing ones report the > > > performances > > > >> of > > > >> >> outdated releases of OpenNLP, and focus more on the execution > time > > or > > > >> >> CPU/RAM consumption, etc. > > > >> >> > > > >> >> I think such a comparison will help not only evaluate the overall > > > >> accuracy, > > > >> >> but also highlight the issues with the existing models (as a > matter > > > of > > > >> >> fact, the existing models fail to recognize many of the hashtags > in > > > >> tweets: > > > >> >> the tokenizer splits them into the "#" symbol and a word that the > > PoS > > > >> >> tagger also fails to recognize). > > > >> >> > > > >> >> Therefore, building Twitter-based models would also be useful, > > since > > > >> many > > > >> >> of the works in academia / industry are focusing on Twitter data. > > > >> >> > > > >> >> Best regards, > > > >> >> > > > >> >> Mondher > > > >> >> > > > >> >> > > > >> >> > > > >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge < > > > >> jasonbaldri...@gmail.com> > > > >> >> wrote: > > > >> >> > > > >> >> > It would be fantastic to have these numbers. This is an example > > of > > > >> >> > something that would be a great contribution by someone trying > to > > > >> >> > contribute to open source and who is maybe just getting into > > > machine > > > >> >> > learning and natural language processing. > > > >> >> > > > > >> >> > For Twitter-ish text, it'd be great to look at models trained > and > > > >> evaluated > > > >> >> > on the Tweet NLP resources: > > > >> >> > > > > >> >> > http://www.cs.cmu.edu/~ark/TweetNLP/ > > > >> >> > > > > >> >> > And comparing to how their models performed, etc. Also, it's > > worth > > > >> looking > > > >> >> > at spaCy (Python NLP modules) for further comparisons. > > > >> >> > > > > >> >> > https://spacy.io/ > > > >> >> > > > > >> >> > -Jason > > > >> >> > > > > >> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick < > > > jzemer...@apache.org> > > > >> >> > wrote: > > > >> >> > > > > >> >> > > I saw the same question on the users list on June 17. At > least > > I > > > >> thought > > > >> >> > it > > > >> >> > > was the same question -- sorry if it wasn't. > > > >> >> > > > > > >> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) < > > > >> >> > > chris.a.mattm...@jpl.nasa.gov> wrote: > > > >> >> > > > > > >> >> > > > Well, hold on. He sent that mail (as of the time of this > > mail) > > > 4 > > > >> >> > > > mins previously. Maybe some folks need some time to reply > ^_^ > > > >> >> > > > > > > >> >> > > > > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > >> >> > > > Chris Mattmann, Ph.D. > > > >> >> > > > Chief Architect > > > >> >> > > > Instrument Software and Science Data Systems Section (398) > > > >> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > > >> >> > > > Office: 168-519, Mailstop: 168-527 > > > >> >> > > > Email: chris.a.mattm...@nasa.gov > > > >> >> > > > WWW: http://sunset.usc.edu/~mattmann/ > > > >> >> > > > > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > >> >> > > > Director, Information Retrieval and Data Science Group > (IRDS) > > > >> >> > > > Adjunct Associate Professor, Computer Science Department > > > >> >> > > > University of Southern California, Los Angeles, CA 90089 > USA > > > >> >> > > > WWW: http://irds.usc.edu/ > > > >> >> > > > > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" < > > jzemer...@apache.org> > > > >> wrote: > > > >> >> > > > > > > >> >> > > > >Hi Mondher, > > > >> >> > > > > > > > >> >> > > > >Since you didn't get any replies I'm guessing no one is > > aware > > > >> of any > > > >> >> > > > >resources related to what you need. Google Scholar is a > good > > > >> place to > > > >> >> > > look > > > >> >> > > > >for papers referencing OpenNLP and its methods (in case > you > > > >> haven't > > > >> >> > > > >searched it already). > > > >> >> > > > > > > > >> >> > > > >Jeff > > > >> >> > > > > > > > >> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi < > > > >> >> > > > >mondher.bouaz...@gmail.com> wrote: > > > >> >> > > > > > > > >> >> > > > >> Hi, > > > >> >> > > > >> > > > >> >> > > > >> Apologies if you received multiple copies of this > email. I > > > >> sent it > > > >> >> > to > > > >> >> > > > the > > > >> >> > > > >> users list a while ago, and haven't had an answer yet. > > > >> >> > > > >> > > > >> >> > > > >> I have been looking for a while if there is any relevant > > > work > > > >> that > > > >> >> > > > >> performed tests on the OpenNLP tools (in particular the > > > >> Lemmatizer, > > > >> >> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy > > > >> texts such > > > >> >> > as > > > >> >> > > > >> Twitter data, etc., and/or compared it to other > libraries. > > > >> >> > > > >> > > > >> >> > > > >> By performances, I mean accuracy/precision, rather than > > time > > > >> of > > > >> >> > > > execution, > > > >> >> > > > >> etc. > > > >> >> > > > >> > > > >> >> > > > >> If anyone can refer me to a paper or a work done in this > > > >> context, > > > >> >> > that > > > >> >> > > > >> would be of great help. > > > >> >> > > > >> > > > >> >> > > > >> Thank you very much. > > > >> >> > > > >> > > > >> >> > > > >> Mondher > > > >> >> > > > >> > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> > > > > >> > > > > > > > > > > > > > >