RE: GSoC 2015 - WSD Module

2015-06-03 Thread Anthony Beylerian
Dear Jörn,

Thank you for the reply.===
Yes in the draft WSDisambiguator is the main interface.
===
Yes for the disambiguate method the input is expected to be tokenized, it 
should be an input array.
The second argument is for the token index.  We can also make it into an index 
array to support multiple words.
===
Concerning the resources, we expect two types of resources : local and remote 
resources.

+ For local resources, we have two main types :
1- training models for supervised techniques.
2- knowledge resources 

It could be best to make the packaging using similar OpenNLP models for #1.
As for #2, it will depend on what we want to use,  since the type of 
information depends on the specific technique.

+ As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need to 
have some REST support, for example to retrieve a sense inventory for a certain 
word.Actually, the newest semeval task [Semeval15] will use [BabelNet] for WSD 
and EL (Entity Linking).[BabelNet] has an offline version, but the newest one 
is only available through REST.Also, in case it is needed to use a remote 
resource, AND it typically requires a license, we need to use a license key or 
just use the free quota with no key.

Therefore, we thought of having a [ResourceProvider] as mentioned in the 
[draft]. 
Are there any plans to add an external API connector of the sort or is this 
functionality already possible for extension ?
(I noticed there is a [wikinews_importer] in the sanbox)

But in any case we can always start working only locally as a first step, what 
do you think ?
===
It would be more straightforward to use the algorithm names, so ok why not.
===
Yes we have already started working !
What do we need to push to the sandbox ?
===

Thanks !

Anthony 

[BabelNet] : http://babelnet.org/download
[WordsAPI] : https://www.wordsapi.com/
[Semeval15] : http://alt.qcri.org/semeval2015/task13/
[draft] : 
https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1


 Subject: Re: GSoC 2015 - WSD Module
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 Date: Mon, 1 Jun 2015 20:30:08 +0200
 
 Hello,
 
 I had a look at your APIs.
 
 Lets start with the WSDisambiguator. Should that be an interface?
 
 // returns the senses ordered by their score (best one first or only 1
 in supervised case)
 String[] disambiguate(String inputText,int inputWordposition);
 
 Shouldn't we have a tokenized input? Or is the inputText a token?
 
 If you have resources you could package those into OpenNLP models and
 use the existing serialization support. Would that work for you?
 
 I think we should have different implementing classes for different
 algorithms rather than grouping that in the Supervised and Unsupervised
 classes. And also use the algorithm / approach name as part of the class
 name.
 
 As far as I understand you already started to work on this. Should we an
 initial code drop into the sandbox, and then work out things from there?
 We strongly prefer to have as much as possible source code editing
 history in our version control system.
 
 Jörn 
  

Re: GSoC 2015 - WSD Module

2015-06-03 Thread Joern Kottmann
We should not use remote resources. A remote service adds severe limits to
the WSD component. A remote resource will be slow to query (compared to
disk or memory), queries might be expensive (pay per request), the license
might not allow usage in a way the ASL promises to our users. Another issue
is that calling a remote service might leak the document text itself to
that remote service.

Please attach a patch to the jira issue, and then we can pull it into the
sandbox.

Jörn





On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Dear Jörn,

 Thank you for the reply.===
 Yes in the draft WSDisambiguator is the main interface.
 ===
 Yes for the disambiguate method the input is expected to be tokenized, it
 should be an input array.
 The second argument is for the token index.  We can also make it into an
 index array to support multiple words.
 ===
 Concerning the resources, we expect two types of resources : local and
 remote resources.

 + For local resources, we have two main types :
 1- training models for supervised techniques.
 2- knowledge resources

 It could be best to make the packaging using similar OpenNLP models for #1.
 As for #2, it will depend on what we want to use,  since the type of
 information depends on the specific technique.

 + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need
 to have some REST support, for example to retrieve a sense inventory for a
 certain word.Actually, the newest semeval task [Semeval15] will use
 [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
 version, but the newest one is only available through REST.Also, in case it
 is needed to use a remote resource, AND it typically requires a license, we
 need to use a license key or just use the free quota with no key.

 Therefore, we thought of having a [ResourceProvider] as mentioned in the
 [draft].
 Are there any plans to add an external API connector of the sort or is
 this functionality already possible for extension ?
 (I noticed there is a [wikinews_importer] in the sanbox)

 But in any case we can always start working only locally as a first step,
 what do you think ?
 ===
 It would be more straightforward to use the algorithm names, so ok why not.
 ===
 Yes we have already started working !
 What do we need to push to the sandbox ?
 ===

 Thanks !

 Anthony

 [BabelNet] : http://babelnet.org/download
 [WordsAPI] : https://www.wordsapi.com/
 [Semeval15] : http://alt.qcri.org/semeval2015/task13/
 [draft] :
 https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1


  Subject: Re: GSoC 2015 - WSD Module
  From: kottm...@gmail.com
  To: dev@opennlp.apache.org
  Date: Mon, 1 Jun 2015 20:30:08 +0200
 
  Hello,
 
  I had a look at your APIs.
 
  Lets start with the WSDisambiguator. Should that be an interface?
 
  // returns the senses ordered by their score (best one first or only 1
  in supervised case)
  String[] disambiguate(String inputText,int inputWordposition);
 
  Shouldn't we have a tokenized input? Or is the inputText a token?
 
  If you have resources you could package those into OpenNLP models and
  use the existing serialization support. Would that work for you?
 
  I think we should have different implementing classes for different
  algorithms rather than grouping that in the Supervised and Unsupervised
  classes. And also use the algorithm / approach name as part of the class
  name.
 
  As far as I understand you already started to work on this. Should we an
  initial code drop into the sandbox, and then work out things from there?
  We strongly prefer to have as much as possible source code editing
  history in our version control system.
 
  Jörn