Dear Jörn, Thank you for that. After further surveying, I was thinking of beginning the implementation of an approach based on context clustering as a next step. Maybe similar to the one in [1] which relies on a public (CC-A licensed) dataset [2].Since clustering is usually done using K-means, which could take some time with large data, this was already done previously and the results were made publicly available in [3] with up to 20 closest clusters per "phrase". The authors in [1] propose to subsequently apply a Naive Bayes classifier as described in their paper.I believe this is straight-forward enough to implement as another unsupervised approach for the proposed time-frame. Would like your opinion. Regards, Anthony [1] http://nlp.cs.rpi.edu/paper/wsd.pdf[2] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html[3] http://webdocs.cs.ualberta.ca/~bergsma/PhrasalClusters/
> Date: Fri, 19 Jun 2015 16:41:20 +0200 > Subject: Re: GSoC 2015 - WSD Module > From: [email protected] > To: [email protected] > > Hello, > > I will dedicate time tonight to get this pulled in the sandbox and will > then also provide some feedback. > We can then create new patches against the sandbox to fix further issues. > > Jörn > > On Fri, Jun 19, 2015 at 11:02 AM, Anthony Beylerian < > [email protected]> wrote: > > > Thank you for the reply, I am guessing for now we will use the other > > sources. > > > > By the way, I have uploaded a newer patch on the same issue [1]. > > Would like to know if the approach to set parameters is acceptable. > > > > Also, we are referencing to some model files locally like tokenizer, > > tagger, etc because we need them for the preprocessing chain.for example : > > > > ++++++++++++++++++++++ > > private static String modelsDir = > > "src\\test\\resources\\opennlp\\tools\\disambiguator\\"; > > > > TokenizerModel tokenizerModel = new TokenizerModel(new > > FileInputStream(modelsDir + "en-token.bin"));tokenizer = new > > TokenizerME(tokenizerModel); > > ++++++++++++++++++++++ > > > > Thought of adding these files (.bin) in the test folder, but could anyone > > recommend a more elegant way to do this ? > > Thanks ! > > > > Anthony > > > > [1] : https://issues.apache.org/jira/browse/OPENNLP-758 > > > > > > > From: [email protected] > > > Date: Fri, 19 Jun 2015 10:18:12 +0200 > > > Subject: Re: GSoC 2015 - WSD Module > > > To: [email protected] > > > > > > Thanks for the update and the updated patch. > > > > > > With respect to the licensing of BabelNet, I do not think we can > > > redistribute CC BY-NC-SA resources here, but others in this project > > > and Apache in general will probably know better than me. > > > > > > Best, > > > > > > Rodrigo
