How about we keep track of the sets used for performance evaluation and
results in this doc for now:

https://docs.google.com/spreadsheets/d/15c0-u61HNWfQxiDSGjk49M1uBknIfb-LkbP4BDWTB5w/edit?usp=sharing

Will try to take a better look at OntoNotes and what to use from it.
Otherwise, if anyone would like to suggest proper data-sets for testing
each component that would be really helpful

Anthony

On Thu, Jun 23, 2016 at 12:18 AM, Joern Kottmann <kottm...@gmail.com> wrote:

> It would be nice to get MASC support into the OpenNLP formats package.
>
> Jörn
>
> On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge <jasonbaldri...@gmail.com
> >
> wrote:
>
> > Jörn is absolutely right about that. Another good source of training data
> > is MASC. I've got some instructions for training models with MASC here:
> >
> > https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
> >
> > Chalk (now defunct) provided a Scala wrapper around OpenNLP
> functionality,
> > so the instructions there should make it fairly straightforward to adapt
> > MASC data to OpenNLP.
> >
> > -Jason
> >
> > On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <kottm...@gmail.com> wrote:
> >
> > > There are some research papers which study and compare the performance
> of
> > > NLP toolkits, but be careful often they don't train the NLP tools on
> the
> > > same data and the training data makes a big difference on the
> > performance.
> > >
> > > Jörn
> > >
> > > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <kottm...@gmail.com>
> > > wrote:
> > >
> > > > Just don't use the very old existing models, to get good results you
> > have
> > > > to train on your own data, especially if the domain of the data used
> > for
> > > > training and the data which should be processed doesn't match. The
> old
> > > > models are trained on 90s news, those don't work well on todays news
> > and
> > > > probably much worse on tweets.
> > > >
> > > > OntoNots is a good place to start if the goal is to process news.
> > OpenNLP
> > > > comes with build-in support to train models from OntoNotes.
> > > >
> > > > Jörn
> > > >
> > > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > > > chris.a.mattm...@jpl.nasa.gov> wrote:
> > > >
> > > >> This sounds like a fantastic idea.
> > > >>
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> Chris Mattmann, Ph.D.
> > > >> Chief Architect
> > > >> Instrument Software and Science Data Systems Section (398)
> > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > >> Office: 168-519, Mailstop: 168-527
> > > >> Email: chris.a.mattm...@nasa.gov
> > > >> WWW:  http://sunset.usc.edu/~mattmann/
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> Director, Information Retrieval and Data Science Group (IRDS)
> > > >> Adjunct Associate Professor, Computer Science Department
> > > >> University of Southern California, Los Angeles, CA 90089 USA
> > > >> WWW: http://irds.usc.edu/
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <
> > anthonybeyler...@hotmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >> >+1
> > > >> >
> > > >> >Maybe we could put the results of the evaluator tests for each
> > > component
> > > >> somewhere on a webpage and on every release update them.
> > > >> >This is of course provided there are reasonable data sets for
> testing
> > > >> each component.
> > > >> >What do you think?
> > > >> >
> > > >> >Anthony
> > > >> >
> > > >> >> From: mondher.bouaz...@gmail.com
> > > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> > > >> >> Subject: Re: Performances of OpenNLP tools
> > > >> >> To: dev@opennlp.apache.org
> > > >> >>
> > > >> >> Hi,
> > > >> >>
> > > >> >> Thank you for your replies.
> > > >> >>
> > > >> >> Please Jeffrey accept once more my apologies for receiving the
> > email
> > > >> twice.
> > > >> >>
> > > >> >> I also think it would be great to have such studies on the
> > > >> performances of
> > > >> >> OpenNLP.
> > > >> >>
> > > >> >> I have been looking for this information and checked in many
> > places,
> > > >> >> including obviously google scholar, and I haven't found any
> serious
> > > >> studies
> > > >> >> or reliable results. Most of the existing ones report the
> > > performances
> > > >> of
> > > >> >> outdated releases of OpenNLP, and focus more on the execution
> time
> > or
> > > >> >> CPU/RAM consumption, etc.
> > > >> >>
> > > >> >> I think such a comparison will help not only evaluate the overall
> > > >> accuracy,
> > > >> >> but also highlight the issues with the existing models (as a
> matter
> > > of
> > > >> >> fact, the existing models fail to recognize many of the hashtags
> in
> > > >> tweets:
> > > >> >> the tokenizer splits them into the "#" symbol and a word that the
> > PoS
> > > >> >> tagger also fails to recognize).
> > > >> >>
> > > >> >> Therefore, building Twitter-based models would also be useful,
> > since
> > > >> many
> > > >> >> of the works in academia / industry are focusing on Twitter data.
> > > >> >>
> > > >> >> Best regards,
> > > >> >>
> > > >> >> Mondher
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
> > > >> jasonbaldri...@gmail.com>
> > > >> >> wrote:
> > > >> >>
> > > >> >> > It would be fantastic to have these numbers. This is an example
> > of
> > > >> >> > something that would be a great contribution by someone trying
> to
> > > >> >> > contribute to open source and who is maybe just getting into
> > > machine
> > > >> >> > learning and natural language processing.
> > > >> >> >
> > > >> >> > For Twitter-ish text, it'd be great to look at models trained
> and
> > > >> evaluated
> > > >> >> > on the Tweet NLP resources:
> > > >> >> >
> > > >> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
> > > >> >> >
> > > >> >> > And comparing to how their models performed, etc. Also, it's
> > worth
> > > >> looking
> > > >> >> > at spaCy (Python NLP modules) for further comparisons.
> > > >> >> >
> > > >> >> > https://spacy.io/
> > > >> >> >
> > > >> >> > -Jason
> > > >> >> >
> > > >> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <
> > > jzemer...@apache.org>
> > > >> >> > wrote:
> > > >> >> >
> > > >> >> > > I saw the same question on the users list on June 17. At
> least
> > I
> > > >> thought
> > > >> >> > it
> > > >> >> > > was the same question -- sorry if it wasn't.
> > > >> >> > >
> > > >> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> > > >> >> > > chris.a.mattm...@jpl.nasa.gov> wrote:
> > > >> >> > >
> > > >> >> > > > Well, hold on. He sent that mail (as of the time of this
> > mail)
> > > 4
> > > >> >> > > > mins previously. Maybe some folks need some time to reply
> ^_^
> > > >> >> > > >
> > > >> >> > > >
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> >> > > > Chris Mattmann, Ph.D.
> > > >> >> > > > Chief Architect
> > > >> >> > > > Instrument Software and Science Data Systems Section (398)
> > > >> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > >> >> > > > Office: 168-519, Mailstop: 168-527
> > > >> >> > > > Email: chris.a.mattm...@nasa.gov
> > > >> >> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > > >> >> > > >
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> >> > > > Director, Information Retrieval and Data Science Group
> (IRDS)
> > > >> >> > > > Adjunct Associate Professor, Computer Science Department
> > > >> >> > > > University of Southern California, Los Angeles, CA 90089
> USA
> > > >> >> > > > WWW: http://irds.usc.edu/
> > > >> >> > > >
> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <
> > jzemer...@apache.org>
> > > >> wrote:
> > > >> >> > > >
> > > >> >> > > > >Hi Mondher,
> > > >> >> > > > >
> > > >> >> > > > >Since you didn't get any replies I'm guessing no one is
> > aware
> > > >> of any
> > > >> >> > > > >resources related to what you need. Google Scholar is a
> good
> > > >> place to
> > > >> >> > > look
> > > >> >> > > > >for papers referencing OpenNLP and its methods (in case
> you
> > > >> haven't
> > > >> >> > > > >searched it already).
> > > >> >> > > > >
> > > >> >> > > > >Jeff
> > > >> >> > > > >
> > > >> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> > > >> >> > > > >mondher.bouaz...@gmail.com> wrote:
> > > >> >> > > > >
> > > >> >> > > > >> Hi,
> > > >> >> > > > >>
> > > >> >> > > > >> Apologies if you received multiple copies of this
> email. I
> > > >> sent it
> > > >> >> > to
> > > >> >> > > > the
> > > >> >> > > > >> users list a while ago, and haven't had an answer yet.
> > > >> >> > > > >>
> > > >> >> > > > >> I have been looking for a while if there is any relevant
> > > work
> > > >> that
> > > >> >> > > > >> performed tests on the OpenNLP tools (in particular the
> > > >> Lemmatizer,
> > > >> >> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy
> > > >> texts such
> > > >> >> > as
> > > >> >> > > > >> Twitter data, etc., and/or compared it to other
> libraries.
> > > >> >> > > > >>
> > > >> >> > > > >> By performances, I mean accuracy/precision, rather than
> > time
> > > >> of
> > > >> >> > > > execution,
> > > >> >> > > > >> etc.
> > > >> >> > > > >>
> > > >> >> > > > >> If anyone can refer me to a paper or a work done in this
> > > >> context,
> > > >> >> > that
> > > >> >> > > > >> would be of great help.
> > > >> >> > > > >>
> > > >> >> > > > >> Thank you very much.
> > > >> >> > > > >>
> > > >> >> > > > >> Mondher
> > > >> >> > > > >>
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Reply via email to