We should not use remote resources. A remote service adds severe limits to the WSD component. A remote resource will be slow to query (compared to disk or memory), queries might be expensive (pay per request), the license might not allow usage in a way the ASL promises to our users. Another issue is that calling a remote service might leak the document text itself to that remote service.
Please attach a patch to the jira issue, and then we can pull it into the sandbox. Jörn On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian < anthonybeyler...@hotmail.com> wrote: > Dear Jörn, > > Thank you for the reply.=================================== > Yes in the draft WSDisambiguator is the main interface. > =================================== > Yes for the disambiguate method the input is expected to be tokenized, it > should be an input array. > The second argument is for the token index. We can also make it into an > index array to support multiple words. > =================================== > Concerning the resources, we expect two types of resources : local and > remote resources. > > + For local resources, we have two main types : > 1- training models for supervised techniques. > 2- knowledge resources > > It could be best to make the packaging using similar OpenNLP models for #1. > As for #2, it will depend on what we want to use, since the type of > information depends on the specific technique. > > + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need > to have some REST support, for example to retrieve a sense inventory for a > certain word.Actually, the newest semeval task [Semeval15] will use > [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline > version, but the newest one is only available through REST.Also, in case it > is needed to use a remote resource, AND it typically requires a license, we > need to use a license key or just use the free quota with no key. > > Therefore, we thought of having a [ResourceProvider] as mentioned in the > [draft]. > Are there any plans to add an external API connector of the sort or is > this functionality already possible for extension ? > (I noticed there is a [wikinews_importer] in the sanbox) > > But in any case we can always start working only locally as a first step, > what do you think ? > =================================== > It would be more straightforward to use the algorithm names, so ok why not. > =================================== > Yes we have already started working ! > What do we need to push to the sandbox ? > =================================== > > Thanks ! > > Anthony > > [BabelNet] : http://babelnet.org/download > [WordsAPI] : https://www.wordsapi.com/ > [Semeval15] : http://alt.qcri.org/semeval2015/task13/ > [draft] : > https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1 > > > > Subject: Re: GSoC 2015 - WSD Module > > From: kottm...@gmail.com > > To: dev@opennlp.apache.org > > Date: Mon, 1 Jun 2015 20:30:08 +0200 > > > > Hello, > > > > I had a look at your APIs. > > > > Lets start with the WSDisambiguator. Should that be an interface? > > > > // returns the senses ordered by their score (best one first or only 1 > > in supervised case) > > String[] disambiguate(String inputText,int inputWordposition); > > > > Shouldn't we have a tokenized input? Or is the inputText a token? > > > > If you have resources you could package those into OpenNLP models and > > use the existing serialization support. Would that work for you? > > > > I think we should have different implementing classes for different > > algorithms rather than grouping that in the Supervised and Unsupervised > > classes. And also use the algorithm / approach name as part of the class > > name. > > > > As far as I understand you already started to work on this. Should we an > > initial code drop into the sandbox, and then work out things from there? > > We strongly prefer to have as much as possible source code editing > > history in our version control system. > > > > Jörn > >