We should not use remote resources. A remote service adds severe limits to
the WSD component. A remote resource will be slow to query (compared to
disk or memory), queries might be expensive (pay per request), the license
might not allow usage in a way the ASL promises to our users. Another issue
is that calling a remote service might leak the document text itself to
that remote service.

Please attach a patch to the jira issue, and then we can pull it into the
sandbox.

Jörn





On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian <
anthonybeyler...@hotmail.com> wrote:

> Dear Jörn,
>
> Thank you for the reply.===================================
> Yes in the draft WSDisambiguator is the main interface.
> ===================================
> Yes for the disambiguate method the input is expected to be tokenized, it
> should be an input array.
> The second argument is for the token index.  We can also make it into an
> index array to support multiple words.
> ===================================
> Concerning the resources, we expect two types of resources : local and
> remote resources.
>
> + For local resources, we have two main types :
> 1- training models for supervised techniques.
> 2- knowledge resources
>
> It could be best to make the packaging using similar OpenNLP models for #1.
> As for #2, it will depend on what we want to use,  since the type of
> information depends on the specific technique.
>
> + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need
> to have some REST support, for example to retrieve a sense inventory for a
> certain word.Actually, the newest semeval task [Semeval15] will use
> [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
> version, but the newest one is only available through REST.Also, in case it
> is needed to use a remote resource, AND it typically requires a license, we
> need to use a license key or just use the free quota with no key.
>
> Therefore, we thought of having a [ResourceProvider] as mentioned in the
> [draft].
> Are there any plans to add an external API connector of the sort or is
> this functionality already possible for extension ?
> (I noticed there is a [wikinews_importer] in the sanbox)
>
> But in any case we can always start working only locally as a first step,
> what do you think ?
> ===================================
> It would be more straightforward to use the algorithm names, so ok why not.
> ===================================
> Yes we have already started working !
> What do we need to push to the sandbox ?
> ===================================
>
> Thanks !
>
> Anthony
>
> [BabelNet] : http://babelnet.org/download
> [WordsAPI] : https://www.wordsapi.com/
> [Semeval15] : http://alt.qcri.org/semeval2015/task13/
> [draft] :
> https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1
>
>
> > Subject: Re: GSoC 2015 - WSD Module
> > From: kottm...@gmail.com
> > To: dev@opennlp.apache.org
> > Date: Mon, 1 Jun 2015 20:30:08 +0200
> >
> > Hello,
> >
> > I had a look at your APIs.
> >
> > Lets start with the WSDisambiguator. Should that be an interface?
> >
> > // returns the senses ordered by their score (best one first or only 1
> > in supervised case)
> > String[] disambiguate(String inputText,int inputWordposition);
> >
> > Shouldn't we have a tokenized input? Or is the inputText a token?
> >
> > If you have resources you could package those into OpenNLP models and
> > use the existing serialization support. Would that work for you?
> >
> > I think we should have different implementing classes for different
> > algorithms rather than grouping that in the Supervised and Unsupervised
> > classes. And also use the algorithm / approach name as part of the class
> > name.
> >
> > As far as I understand you already started to work on this. Should we an
> > initial code drop into the sandbox, and then work out things from there?
> > We strongly prefer to have as much as possible source code editing
> > history in our version control system.
> >
> > Jörn
>
>

Reply via email to