Hey Cristian, We have tried different approaches such as:
- Lesk (original) [1] - Most frequent sense from the data (MFS) - Extended Lesk (with different scoring functions) - It makes sense (IMS) [2] - A sense clustering approach (I don't immediately recall the reference) Lesk and MFS are meant to be used as baselines for evaluation purpose only. The extended version of Lesk is an effort to improve the original, through additional information from semantic relationships. Although it's not very accurate, it could be useful since it is an unsupervised method (no need for large training data). However, there were some caveats, as both approaches need to pre-load dictionaries as well as score a semantic graph from WordNet at runtime. IMS is a supervised method which we were hoping to mainly use, since it scored around 80% accuracy on SemEval, however that is only for the coarse-grained case. However, in reality words have various degrees of polysemy, and when tested in the fine-grained case the results were much lower. We have also experimented with a simple clustering approach but the improvements were not considerable as far as I remember. I just checked the latest results on Semeval2015 [3] and they look a bit improved on the fine-grained case ~65% F1. However, in some particular domains it looks like the accuracy increases, so it could depend on the use case. On the other hand, there could be some more recent studies that could yield better results, but that would need some more investigation. There are also some other issues such as lack of direct multi-lingual support from WordNet, missing sense definitions etc. We were also still looking for a better source of sense definitions back then. In any case, I believe it would be better to have higher performance before putting this in the official distribution, however that highly depends on the team. Otherwise, different parts of the code just need some simple refactoring as well. Best, Anthony [1] : M. Lesk, Automatic sense disambiguation using machine readable dictionaries [2] : https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf [3] : http://alt.qcri.org/semeval2015/task13/index.php?id=results On Wed, Feb 21, 2018 at 5:26 AM, Cristian Petroaca < cristian.petro...@gmail.com> wrote: > Hi Anthony, > > I'd be interested to discuss this further. > What are the wsd methods used? Any links to papers? > How does the module perform when being evaluated against Senseval? > > How much work do you think it's necessary in order to have a functioning > WSD module in the context of OpenNLP? > > Thanks, > Cristian > > > > On Tue, Feb 20, 2018 at 8:09 AM, Anthony Beylerian < > anthony.beyler...@gmail.com> wrote: > >> Hi Cristian, >> >> Thank you for your interest. >> >> The WSD module is currently experimental, so as far as I am aware there >> is no timeline for it. >> >> You can find the sandboxed version here: >> https://github.com/apache/opennlp-sandbox/tree/master/opennlp-wsd >> >> I personally didn't have the time to revisit this for a while and there >> are still some details to work out. >> But if you are really interested, you are welcome to discuss and >> contribute. >> I will assist as much as possible. >> >> Best, >> >> Anthony >> >> On Sun, Feb 18, 2018 at 5:52 AM, Cristian Petroaca < >> cristian.petro...@gmail.com> wrote: >> >>> Hi, >>> >>> I'm interested in word sense disambiguation (particularly based on >>> Wordnet). I noticed that the latest OpenNLP version doesn't have any but >>> I >>> remember that a couple of years ago there was somebody working on >>> implementing it. Why isn't it in the official OpenNLP jar? Is there a >>> timeline for adding it? >>> >>> Thanks, >>> Cristian >>> >> >> >