Please excuse the duplicate email, we could not attach the mentioned figure. Kindly find it here. Thank you.
From: anthonybeyler...@hotmail.com To: dev@opennlp.apache.org Subject: GSoC 2015 - WSD Module Date: Mon, 18 May 2015 22:14:43 +0900 Dear all, In the context of building a Word Sense Disambiguation (WSD) module, after doing a survey on WSD techniques, we realized the following points : - WSD techniques can be split into three sets (supervised, unsupervised/knowledge based, hybrid) - WSD is used for different directly related objectives such as all-words disambiguation, lexical sample disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval seem to be good references to compare different techniques for WSD since many of them were tested on the same data (but different one each event).- For the sake of making a first solution, we propose to start with supporting the "lexical sample" type of disambiguation, meaning to disambiguate single/limited word(s) from an input text. Therefore, we have decided to collect information about the different techniques in the literature (such as references, performance, parameters etc.) in this spreadsheet here.Otherwise we have also collected the results of all the senseval/semeval exercises here.(Note that each document has many sheets)The collected results, could help decide on which techniques to start with as main models for each set of techniques (supervised/unsupervised). We also propose a general approach for the package in the figure attached.The main components are as follows : 1- The different resources publicly available : WordNet, BabelNet, Wikipedia, etc.However, we would also like to allow the users to use their own local resources, by maybe defining a type of connector to the resource interface. 2- The resource interface will have the role to provide both a sense inventory that the user can query and a knowledge base (such as semantic or syntactic info. etc.) that might be used depending on the technique.We might even later consider building a local cache for remote services. 3- The WSD algorithms/techniques themselves that will make use of the resource interface to access the resources required.These techniques will be split into two main packages as in the left side of the figure : Supervised/Unsupervised.The utils package includes common tools used in both types of techniques.The details mentioned in each package should be common to all implementations of these abstract models. 4- I/O could be processed in different formats (XML/JSON etc) or a simpler structure following your recommendations. If you have any suggestions or recommendations, we would really appreciate discussing them and would like your guidance to iterate on this tool-set. Best regards, Anthony Beylerian, Mondher Bouazizi