Hi, I attached an initial patch to OPENNLP-758. However, we are currently modifying things a bit since many approaches need to be supported, but would like your recommendations. Here are some notes :
1 - We used extJWNL 2- [WSDisambiguator] is the main interface 3- [Loader] loads the resources required 4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo. 5- [Lesk] has many variants, we already implemented some, but wondering on the preferred way to switch from one to the other: As of now we use one of them as default, but we thought of either making a parameter list to fill or make separate classes for each, or otherwise following your preference. 6- The other classes are for convenience. We will try to patch frequently on the separate issues, following the feedback. Best regards, Anthony > Date: Wed, 10 Jun 2015 11:42:56 +0200 > Subject: Re: GSoC 2015 - WSD Module > From: kottm...@gmail.com > To: dev@opennlp.apache.org > > You can attach the patch to one of the issues, you can create an new issue. > In the end it doesn't matter much, but important is that we make progress > here and get the initial code into our repository. Subsequent changes can > then be done in a patch series. > > Please try to submit the patch as quickly as possible. > > Jörn > > On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri <rage...@apache.org> wrote: > > > Hello, > > > > On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi > > <mondher.bouaz...@gmail.com> wrote: > > > Dear Rodrigo, > > > > > > As Anthony mentioned in his previous email, I already started the > > > implementation of the IMS approach. The pre-processing and the extraction > > > of features have already been finished. Regarding the approach itself, it > > > shows some potential according to the author though the features proposed > > > are not so many, and are basic. > > > > Hi, yes, the features are not that complex, but it is good to have a > > working system and then if needed the feature set can be > > improved/enriched. As stated in the paper, the IMS approach leverages > > parallel data to obtain state of the art results in both lexical > > sample and all words for senseval 3 and semeval 2007 datasets. > > > > I think it will be nice to have a working system with this algorithm > > as part of the WSD component in OpenNLP (following the API discussion > > previous in this thread) and perform some evaluations to know where > > the system is with respect to state of the art results in those > > datasets. Once this is operative, I think it will be a good moment to > > start discussing additional/better features. > > > > > I think the approach itself might be > > > enhanced if we add more context specific features from some other > > > approaches... (To do that, I need to run many experiments using different > > > combinations of features, however, that should not be a problem). > > > > Speaking about the feature sets, in the API google doc I have not seen > > anything about the implementation of the feature extractors, could you > > perhaps provide some extra info (in that same document, for example) > > about that? > > > > > But the approach itself requires a linear SVM classifier, and as far as I > > > know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use > > libsvm > > > ? > > > > I think you can try with a MaxEnt to start with and in the meantime, > > @Jörn has commented sometimes that there is a plugin component in > > OpenNLP to use third-party ML libraries and that he tested it with > > Mallet. Perhaps he could comment on this to use that functionality to > > use SVMs. > > > > > > > > Regarding the training data, I started collecting some from different > > > sources. Most of the existing rich corpora are licensed (Including the > > ones > > > mentioned in the paper). The free ones I got for now are from the > > Senseval > > > and Semeval websites. However, these are used just to evaluate the > > proposed > > > methods in the workshops. Therefore, the words to disambiguate are few in > > > number though the training data for each word are rich enough. > > > > > > In any case, the first tests with Senseval and Semeval collected should > > be > > > finished soon. However, I am not sure if there is a rich enough Dataset > > we > > > can use to make our model for the WSD module in the OpenNLP library. > > > If you have any recommendation, I would be grateful if you can help me on > > > this point. > > > > Well, as I said in my previous email, research around "word senses" is > > moving from WSD towards Supersense tagging where there are recent > > papers and freely available tweet datasets, for example. In any case, > > we can look more into it but in the meantime the Semcor for training > > and senseval/semeval2007 datasets for evaluation should be enough to > > compare your system with the literature. > > > > > > > > As Jörn mentioned sending an initial patch, should we separate our codes > > > and upload two different patches to the two issues we created on the Jira > > > (however, this means a lot of redundancy in the code), or shall we keep > > > them in one project and upload it? If we opt for the latter case, which > > > issue should we upload the patch to ? > > > > In my opinion, it should be the same patch and same Component with > > different algorithm implementations within it. Any other opinions? > > > > Cheers, > > > > Rodrigo > >