Please open a jira issues for this, and for other GSOC tasks. I would like to use jira to plan the outstanding tasks.
Are you working on this currently? Jörn On Mon, 2015-06-22 at 00:55 +0900, Anthony Beylerian wrote: > Dear Jörn, > Thank you for that. > > After further surveying, I was thinking of beginning the implementation of an > approach based on context clustering as a next step. > Maybe similar to the one in [1] which relies on a public (CC-A licensed) > dataset [2].Since clustering is usually done using K-means, which could take > some time with large data, this was already done previously and the results > were made publicly available in [3] with up to 20 closest clusters per > "phrase". > The authors in [1] propose to subsequently apply a Naive Bayes classifier as > described in their paper.I believe this is straight-forward enough to > implement as another unsupervised approach for the proposed time-frame. > Would like your opinion. > Regards, > Anthony > [1] http://nlp.cs.rpi.edu/paper/wsd.pdf[2] > http://storage.googleapis.com/books/ngrams/books/datasetsv2.html[3] > http://webdocs.cs.ualberta.ca/~bergsma/PhrasalClusters/ > > > > Date: Fri, 19 Jun 2015 16:41:20 +0200 > > Subject: Re: GSoC 2015 - WSD Module > > From: kottm...@gmail.com > > To: dev@opennlp.apache.org > > > > Hello, > > > > I will dedicate time tonight to get this pulled in the sandbox and will > > then also provide some feedback. > > We can then create new patches against the sandbox to fix further issues. > > > > Jörn > > > > On Fri, Jun 19, 2015 at 11:02 AM, Anthony Beylerian < > > anthonybeyler...@hotmail.com> wrote: > > > > > Thank you for the reply, I am guessing for now we will use the other > > > sources. > > > > > > By the way, I have uploaded a newer patch on the same issue [1]. > > > Would like to know if the approach to set parameters is acceptable. > > > > > > Also, we are referencing to some model files locally like tokenizer, > > > tagger, etc because we need them for the preprocessing chain.for example : > > > > > > ++++++++++++++++++++++ > > > private static String modelsDir = > > > "src\\test\\resources\\opennlp\\tools\\disambiguator\\"; > > > > > > TokenizerModel tokenizerModel = new TokenizerModel(new > > > FileInputStream(modelsDir + "en-token.bin"));tokenizer = new > > > TokenizerME(tokenizerModel); > > > ++++++++++++++++++++++ > > > > > > Thought of adding these files (.bin) in the test folder, but could anyone > > > recommend a more elegant way to do this ? > > > Thanks ! > > > > > > Anthony > > > > > > [1] : https://issues.apache.org/jira/browse/OPENNLP-758 > > > > > > > > > > From: rage...@apache.org > > > > Date: Fri, 19 Jun 2015 10:18:12 +0200 > > > > Subject: Re: GSoC 2015 - WSD Module > > > > To: dev@opennlp.apache.org > > > > > > > > Thanks for the update and the updated patch. > > > > > > > > With respect to the licensing of BabelNet, I do not think we can > > > > redistribute CC BY-NC-SA resources here, but others in this project > > > > and Apache in general will probably know better than me. > > > > > > > > Best, > > > > > > > > Rodrigo >
signature.asc
Description: This is a digitally signed message part