Re: GSoC 2015 - WSD Module

2015-06-30 Thread Joern Kottmann
Can you please open some jira issues so we can better keep track of what
has to be done.

Jörn
On Jun 28, 2015 10:23 PM, Joern Kottmann kottm...@gmail.com wrote:

 Yes, the performance testing has to be there, otherwise it is hard to
 tell if it works or not.

 Jörn

 On Mon, 2015-06-29 at 02:02 +0900, Anthony Beylerian wrote:
  Dear Jörn,
 
  As a first milestone, for now we have the main interface with two
 implementations (one unsupervised, one supervised), maybe we can add an
 evaluator for performance tests and comparison with the test data we
 currently have (SemEval, SensEval test sets).
 
  Best,
 
  Anthony
 
   Subject: Re: GSoC 2015 - WSD Module
   From: kottm...@gmail.com
   To: dev@opennlp.apache.org
   Date: Thu, 25 Jun 2015 21:47:22 +0200
  
   On Wed, 2015-06-10 at 22:13 +0900, Anthony Beylerian wrote:
Hi,
   
I attached an initial patch to OPENNLP-758.
However, we are currently modifying things a bit since many
 approaches need to be supported, but would like your recommendations.
Here are some notes :
   
1 - We used extJWNL
2- [WSDisambiguator] is the main interface
3- [Loader] loads the resources required
4- Please check [FeaturesExtractor] for the mentioned methods by
 Rodrigo.
5- [Lesk] has many variants, we already implemented some, but
 wondering on the preferred way to switch from one to the other:
As of now we use one of them as default, but we thought of either
 making a parameter list to fill or make separate classes for each, or
 otherwise following your preference.
6- The other classes are for convenience.
   
We will try to patch frequently on the separate issues, following
 the feedback.
  
  
   Sounds good, I reviewed it and think what we have is quite ok.
  
   Most important now is to fix the smaller issues (see the jira issue)
 and
   explain to us how it can be run.
  
   The midterm evaluation is coming up next week as well.
  
   How are we standing with the milstone we set?
  
   Jörn
  
 




Re: GSoC 2015 - WSD Module

2015-06-28 Thread Joern Kottmann
Yes, the performance testing has to be there, otherwise it is hard to
tell if it works or not.

Jörn

On Mon, 2015-06-29 at 02:02 +0900, Anthony Beylerian wrote:
 Dear Jörn,
 
 As a first milestone, for now we have the main interface with two 
 implementations (one unsupervised, one supervised), maybe we can add an 
 evaluator for performance tests and comparison with the test data we 
 currently have (SemEval, SensEval test sets).  
 
 Best,
 
 Anthony
 
  Subject: Re: GSoC 2015 - WSD Module
  From: kottm...@gmail.com
  To: dev@opennlp.apache.org
  Date: Thu, 25 Jun 2015 21:47:22 +0200
  
  On Wed, 2015-06-10 at 22:13 +0900, Anthony Beylerian wrote:
   Hi,
   
   I attached an initial patch to OPENNLP-758.
   However, we are currently modifying things a bit since many approaches 
   need to be supported, but would like your recommendations.
   Here are some notes : 
   
   1 - We used extJWNL
   2- [WSDisambiguator] is the main interface
   3- [Loader] loads the resources required
   4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
   5- [Lesk] has many variants, we already implemented some, but wondering 
   on the preferred way to switch from one to the other:
   As of now we use one of them as default, but we thought of either making 
   a parameter list to fill or make separate classes for each, or otherwise 
   following your preference.
   6- The other classes are for convenience.
   
   We will try to patch frequently on the separate issues, following the 
   feedback.
  
  
  Sounds good, I reviewed it and think what we have is quite ok.
  
  Most important now is to fix the smaller issues (see the jira issue) and
  explain to us how it can be run.
  
  The midterm evaluation is coming up next week as well.
  
  How are we standing with the milstone we set?
  
  Jörn
  
 



signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-25 Thread Joern Kottmann
On Mon, 2015-06-22 at 00:55 +0900, Anthony Beylerian wrote:
 Dear Jörn,
 Thank you for that.
 
 After further surveying, I was thinking of beginning the implementation of an 
 approach based on context clustering as a next step.
 Maybe similar to the one in [1] which relies on a public (CC-A licensed) 
 dataset [2].Since clustering is usually done using K-means, which could take 
 some time with large data, this was already done previously and the results 
 were made publicly available in [3] with up to 20 closest clusters per 
 phrase.
 The authors in [1] propose to subsequently apply a Naive Bayes classifier as 
 described in their paper.I believe this is straight-forward enough to 
 implement as another unsupervised approach for the proposed time-frame.
 Would like your opinion.

Sounds good to me. I will read the paper now, and come back here if I
have any questions.

Jörn


signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-25 Thread Joern Kottmann
On Wed, 2015-06-10 at 22:13 +0900, Anthony Beylerian wrote:
 Hi,
 
 I attached an initial patch to OPENNLP-758.
 However, we are currently modifying things a bit since many approaches need 
 to be supported, but would like your recommendations.
 Here are some notes : 
 
 1 - We used extJWNL
 2- [WSDisambiguator] is the main interface
 3- [Loader] loads the resources required
 4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
 5- [Lesk] has many variants, we already implemented some, but wondering on 
 the preferred way to switch from one to the other:
 As of now we use one of them as default, but we thought of either making a 
 parameter list to fill or make separate classes for each, or otherwise 
 following your preference.
 6- The other classes are for convenience.
 
 We will try to patch frequently on the separate issues, following the 
 feedback.


Sounds good, I reviewed it and think what we have is quite ok.

Most important now is to fix the smaller issues (see the jira issue) and
explain to us how it can be run.

The midterm evaluation is coming up next week as well.

How are we standing with the milstone we set?

Jörn



signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-25 Thread Joern Kottmann
On Mon, 2015-06-22 at 00:55 +0900, Anthony Beylerian wrote:
 Dear Jörn,
 Thank you for that.
 
 After further surveying, I was thinking of beginning the implementation of an 
 approach based on context clustering as a next step.
 Maybe similar to the one in [1] which relies on a public (CC-A licensed) 
 dataset [2].Since clustering is usually done using K-means, which could take 
 some time with large data, this was already done previously and the results 
 were made publicly available in [3] with up to 20 closest clusters per 
 phrase.
 The authors in [1] propose to subsequently apply a Naive Bayes classifier as 
 described in their paper.I believe this is straight-forward enough to 
 implement as another unsupervised approach for the proposed time-frame.
 Would like your opinion.

Your users can just download the dataset and do the clustering them
self. It should be possible to do that anyway. All the code necessary to
do that should be available as part of your contribution.

Jörn


signature.asc
Description: This is a digitally signed message part


Re: GSoC 2015 - WSD Module

2015-06-19 Thread Rodrigo Agerri
Thanks for the update and the updated patch.

With respect to the licensing of BabelNet, I do not think we can
redistribute CC BY-NC-SA resources here, but others in this project
and Apache in general will probably know better than me.

Best,

Rodrigo

On Sun, Jun 14, 2015 at 2:47 PM, Anthony Beylerian
anthonybeyler...@hotmail.com wrote:
 Hi,
 Concerning this point, I would like to ask about BabelNet [1].The advantages 
 of [1] is that it integrates WordNet, Wikipedia, Wiktionary, OmegaWiki, 
 Wikidata, and Open Multi-WordNet.
 Also, the newest SemEval task (which results are just out [2]) relies on it.

 Howeover, the 2.5.1 version, which can be used locally, follows a CC BY-NC-SA 
 3.0 license [3].I read in [4] that CC-A (Attribution) licenses are 
 acceptable, however I am not completely sure if the NC-SA 
 (Non-commercial/ShareAlike) terms would be prohibitive since it was mentioned 
 that :
 Many of these licenses have specific attribution terms that need to be 
 adhered to, for example CC-A, often by adding them to the NOTICE file. Ensure 
 you are doing this when including these works. Note, this list is 
 colloquially known as the Category A list.
 Would like your thoughts on the matter.
 Thanks !
 Anthony
 [1] : http://babelnet.org/download[2] : 
 http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval049.pdf[3] : 
 https://creativecommons.org/licenses/by-nc-sa/3.0/
 [4] : http://www.apache.org/legal/resolved.html#category-a

 Date: Fri, 5 Jun 2015 15:09:24 +0200
 Subject: Re: GSoC 2015 - WSD Module
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org

 Hello,

 yes, wordnet is fine, we already depend on it. I just think that remote
 resources are particular problematic.

 For local resources it boils down to their license.

 Here is the wordnet one:
 http://wordnet.princeton.edu/wordnet/license/

 We might even be able to redistribute this here at Apache, which is really
 nice. To do that we have to check
 with the legal list if they give a green light for it.

 You can get more information about licenses and dependencies for Apache
 projects here:
 http://www.apache.org/legal/resolved.html#category-a
 http://www.apache.org/legal/resolved.html#category-b
 http://www.apache.org/legal/resolved.html#category-x

 Are the things you have to clean up of the nature that you couldn't do that
 after you send in a patch?
 This could be removal of code which can be released under ASL.

 We would like to get you integrated into the way we work here as quickly as
 possible.

 That includes:
 - Tasks are planned/tracked via jira (this allows other people to
 comment/follow)
 - We would like to be able to review your code and maybe give some advice
 (commit often, break things down in tasks)
 - Changes or new features are usually discussed a on the dev list (e.g. a
 short write up about the approaches you implemented
   or better plan to implement)

 Jörn




Re: GSoC 2015 - WSD Module

2015-06-19 Thread Joern Kottmann
Hello,

I will dedicate time tonight to get this pulled in the sandbox and will
then also provide some feedback.
We can then create new patches against the sandbox to fix further issues.

Jörn

On Fri, Jun 19, 2015 at 11:02 AM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Thank you for the reply, I am guessing for now we will use the other
 sources.

 By the way, I  have uploaded a newer patch on the same issue [1].
 Would like to know if the approach to set parameters is acceptable.

 Also, we are referencing to some model files locally like tokenizer,
 tagger, etc because we need them for the preprocessing chain.for example :

 ++
 private static String modelsDir =
 src\\test\\resources\\opennlp\\tools\\disambiguator\\;

 TokenizerModel  tokenizerModel = new TokenizerModel(new
 FileInputStream(modelsDir + en-token.bin));tokenizer = new
 TokenizerME(tokenizerModel);
 ++

 Thought of adding these files (.bin) in the test folder, but could anyone
 recommend a more elegant way  to do this ?
 Thanks !

 Anthony

 [1] : https://issues.apache.org/jira/browse/OPENNLP-758


  From: rage...@apache.org
  Date: Fri, 19 Jun 2015 10:18:12 +0200
  Subject: Re: GSoC 2015 - WSD Module
  To: dev@opennlp.apache.org
 
  Thanks for the update and the updated patch.
 
  With respect to the licensing of BabelNet, I do not think we can
  redistribute CC BY-NC-SA resources here, but others in this project
  and Apache in general will probably know better than me.
 
  Best,
 
  Rodrigo
 
  On Sun, Jun 14, 2015 at 2:47 PM, Anthony Beylerian
  anthonybeyler...@hotmail.com wrote:
   Hi,
   Concerning this point, I would like to ask about BabelNet [1].The
 advantages of [1] is that it integrates WordNet, Wikipedia, Wiktionary,
 OmegaWiki, Wikidata, and Open Multi-WordNet.
   Also, the newest SemEval task (which results are just out [2]) relies
 on it.
  
   Howeover, the 2.5.1 version, which can be used locally, follows a CC
 BY-NC-SA 3.0 license [3].I read in [4] that CC-A (Attribution) licenses are
 acceptable, however I am not completely sure if the NC-SA
 (Non-commercial/ShareAlike) terms would be prohibitive since it was
 mentioned that :
   Many of these licenses have specific attribution terms that need to
 be adhered to, for example CC-A, often by adding them to the NOTICE file.
 Ensure you are doing this when including these works. Note, this list is
 colloquially known as the Category A list.
   Would like your thoughts on the matter.
   Thanks !
   Anthony
   [1] : http://babelnet.org/download[2] :
 http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval049.pdf[3] :
 https://creativecommons.org/licenses/by-nc-sa/3.0/
   [4] : http://www.apache.org/legal/resolved.html#category-a
  
   Date: Fri, 5 Jun 2015 15:09:24 +0200
   Subject: Re: GSoC 2015 - WSD Module
   From: kottm...@gmail.com
   To: dev@opennlp.apache.org
  
   Hello,
  
   yes, wordnet is fine, we already depend on it. I just think that
 remote
   resources are particular problematic.
  
   For local resources it boils down to their license.
  
   Here is the wordnet one:
   http://wordnet.princeton.edu/wordnet/license/
  
   We might even be able to redistribute this here at Apache, which is
 really
   nice. To do that we have to check
   with the legal list if they give a green light for it.
  
   You can get more information about licenses and dependencies for
 Apache
   projects here:
   http://www.apache.org/legal/resolved.html#category-a
   http://www.apache.org/legal/resolved.html#category-b
   http://www.apache.org/legal/resolved.html#category-x
  
   Are the things you have to clean up of the nature that you couldn't
 do that
   after you send in a patch?
   This could be removal of code which can be released under ASL.
  
   We would like to get you integrated into the way we work here as
 quickly as
   possible.
  
   That includes:
   - Tasks are planned/tracked via jira (this allows other people to
   comment/follow)
   - We would like to be able to review your code and maybe give some
 advice
   (commit often, break things down in tasks)
   - Changes or new features are usually discussed a on the dev list
 (e.g. a
   short write up about the approaches you implemented
 or better plan to implement)
  
   Jörn
  
  




RE: GSoC 2015 - WSD Module

2015-06-14 Thread Anthony Beylerian
Hi,
Concerning this point, I would like to ask about BabelNet [1].The advantages of 
[1] is that it integrates WordNet, Wikipedia, Wiktionary, OmegaWiki, Wikidata, 
and Open Multi-WordNet.
Also, the newest SemEval task (which results are just out [2]) relies on it.

Howeover, the 2.5.1 version, which can be used locally, follows a CC BY-NC-SA 
3.0 license [3].I read in [4] that CC-A (Attribution) licenses are acceptable, 
however I am not completely sure if the NC-SA (Non-commercial/ShareAlike) terms 
would be prohibitive since it was mentioned that : 
Many of these licenses have specific attribution terms that need to be adhered 
to, for example CC-A, often by adding them to the NOTICE file. Ensure you are 
doing this when including these works. Note, this list is colloquially known as 
the Category A list.
Would like your thoughts on the matter.
Thanks !
Anthony
[1] : http://babelnet.org/download[2] : 
http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval049.pdf[3] : 
https://creativecommons.org/licenses/by-nc-sa/3.0/
[4] : http://www.apache.org/legal/resolved.html#category-a

 Date: Fri, 5 Jun 2015 15:09:24 +0200
 Subject: Re: GSoC 2015 - WSD Module
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 
 Hello,
 
 yes, wordnet is fine, we already depend on it. I just think that remote
 resources are particular problematic.
 
 For local resources it boils down to their license.
 
 Here is the wordnet one:
 http://wordnet.princeton.edu/wordnet/license/
 
 We might even be able to redistribute this here at Apache, which is really
 nice. To do that we have to check
 with the legal list if they give a green light for it.
 
 You can get more information about licenses and dependencies for Apache
 projects here:
 http://www.apache.org/legal/resolved.html#category-a
 http://www.apache.org/legal/resolved.html#category-b
 http://www.apache.org/legal/resolved.html#category-x
 
 Are the things you have to clean up of the nature that you couldn't do that
 after you send in a patch?
 This could be removal of code which can be released under ASL.
 
 We would like to get you integrated into the way we work here as quickly as
 possible.
 
 That includes:
 - Tasks are planned/tracked via jira (this allows other people to
 comment/follow)
 - We would like to be able to review your code and maybe give some advice
 (commit often, break things down in tasks)
 - Changes or new features are usually discussed a on the dev list (e.g. a
 short write up about the approaches you implemented
   or better plan to implement)
 
 Jörn

  

RE: GSoC 2015 - WSD Module

2015-06-10 Thread Anthony Beylerian
Hi,

I attached an initial patch to OPENNLP-758.
However, we are currently modifying things a bit since many approaches need to 
be supported, but would like your recommendations.
Here are some notes : 

1 - We used extJWNL
2- [WSDisambiguator] is the main interface
3- [Loader] loads the resources required
4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
5- [Lesk] has many variants, we already implemented some, but wondering on the 
preferred way to switch from one to the other:
As of now we use one of them as default, but we thought of either making a 
parameter list to fill or make separate classes for each, or otherwise 
following your preference.
6- The other classes are for convenience.

We will try to patch frequently on the separate issues, following the feedback.

Best regards,

Anthony

 Date: Wed, 10 Jun 2015 11:42:56 +0200
 Subject: Re: GSoC 2015 - WSD Module
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 
 You can attach the patch to one of the issues, you can create an new issue.
 In the end it doesn't matter much, but important is that we make progress
 here and get the initial code into our repository. Subsequent changes can
 then be done in a patch series.
 
 Please try to submit the patch as quickly as possible.
 
 Jörn
 
 On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri rage...@apache.org wrote:
 
  Hello,
 
  On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
  mondher.bouaz...@gmail.com wrote:
   Dear Rodrigo,
  
   As Anthony mentioned in his previous email, I already started the
   implementation of the IMS approach. The pre-processing and the extraction
   of features have already been finished. Regarding the approach itself, it
   shows some potential according to the author though the features proposed
   are not so many, and are basic.
 
  Hi, yes, the features are not that complex, but it is good to have a
  working system and then if needed the feature set can be
  improved/enriched. As stated in the paper, the IMS approach leverages
  parallel data to obtain state of the art results in both lexical
  sample and all words for senseval 3 and semeval 2007 datasets.
 
  I think it will be nice to have a working system with this algorithm
  as part of the WSD component in OpenNLP (following the API discussion
  previous in this thread) and perform some evaluations to know where
  the system is with respect to state of the art results in those
  datasets. Once this is operative, I think it will be a good moment to
  start discussing additional/better features.
 
   I think the approach itself might be
   enhanced if we add more context specific features from some other
   approaches... (To do that, I need to run many experiments using different
   combinations of features, however, that should not be a problem).
 
  Speaking about the feature sets, in the API google doc I have not seen
  anything about the implementation of the feature extractors, could you
  perhaps provide some extra info (in that same document, for example)
  about that?
 
   But the approach itself requires a linear SVM classifier, and as far as I
   know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
  libsvm
   ?
 
  I think you can try with a MaxEnt to start with and in the meantime,
  @Jörn has commented sometimes that there is a plugin component in
  OpenNLP to use third-party ML libraries and that he tested it with
  Mallet. Perhaps he could comment on this to use that functionality to
  use SVMs.
 
  
   Regarding the training data, I started collecting some from different
   sources. Most of the existing rich corpora are licensed (Including the
  ones
   mentioned in the paper). The free ones I got for now are from the
  Senseval
   and Semeval websites. However, these are used just to evaluate the
  proposed
   methods in the workshops. Therefore, the words to disambiguate are few in
   number though the training data for each word are rich enough.
  
   In any case, the first tests with Senseval and Semeval collected should
  be
   finished soon. However, I am not sure if there is a rich enough Dataset
  we
   can use to make our model for the WSD module in the OpenNLP library.
   If you have any recommendation, I would be grateful if you can help me on
   this point.
 
  Well, as I said in my previous email, research around word senses is
  moving from WSD towards Supersense tagging where there are recent
  papers and freely available tweet datasets, for example. In any case,
  we can look more into it but in the meantime the Semcor for training
  and senseval/semeval2007 datasets for evaluation should be enough to
  compare your system with the literature.
 
  
   As Jörn mentioned sending an initial patch, should we separate our codes
   and upload two different patches to the two issues we created on the Jira
   (however, this means a lot of redundancy in the code), or shall we keep
   them in one project and upload it? If we opt for the latter case

Re: GSoC 2015 - WSD Module

2015-06-10 Thread Joern Kottmann
You can attach the patch to one of the issues, you can create an new issue.
In the end it doesn't matter much, but important is that we make progress
here and get the initial code into our repository. Subsequent changes can
then be done in a patch series.

Please try to submit the patch as quickly as possible.

Jörn

On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri rage...@apache.org wrote:

 Hello,

 On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
 mondher.bouaz...@gmail.com wrote:
  Dear Rodrigo,
 
  As Anthony mentioned in his previous email, I already started the
  implementation of the IMS approach. The pre-processing and the extraction
  of features have already been finished. Regarding the approach itself, it
  shows some potential according to the author though the features proposed
  are not so many, and are basic.

 Hi, yes, the features are not that complex, but it is good to have a
 working system and then if needed the feature set can be
 improved/enriched. As stated in the paper, the IMS approach leverages
 parallel data to obtain state of the art results in both lexical
 sample and all words for senseval 3 and semeval 2007 datasets.

 I think it will be nice to have a working system with this algorithm
 as part of the WSD component in OpenNLP (following the API discussion
 previous in this thread) and perform some evaluations to know where
 the system is with respect to state of the art results in those
 datasets. Once this is operative, I think it will be a good moment to
 start discussing additional/better features.

  I think the approach itself might be
  enhanced if we add more context specific features from some other
  approaches... (To do that, I need to run many experiments using different
  combinations of features, however, that should not be a problem).

 Speaking about the feature sets, in the API google doc I have not seen
 anything about the implementation of the feature extractors, could you
 perhaps provide some extra info (in that same document, for example)
 about that?

  But the approach itself requires a linear SVM classifier, and as far as I
  know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
 libsvm
  ?

 I think you can try with a MaxEnt to start with and in the meantime,
 @Jörn has commented sometimes that there is a plugin component in
 OpenNLP to use third-party ML libraries and that he tested it with
 Mallet. Perhaps he could comment on this to use that functionality to
 use SVMs.

 
  Regarding the training data, I started collecting some from different
  sources. Most of the existing rich corpora are licensed (Including the
 ones
  mentioned in the paper). The free ones I got for now are from the
 Senseval
  and Semeval websites. However, these are used just to evaluate the
 proposed
  methods in the workshops. Therefore, the words to disambiguate are few in
  number though the training data for each word are rich enough.
 
  In any case, the first tests with Senseval and Semeval collected should
 be
  finished soon. However, I am not sure if there is a rich enough Dataset
 we
  can use to make our model for the WSD module in the OpenNLP library.
  If you have any recommendation, I would be grateful if you can help me on
  this point.

 Well, as I said in my previous email, research around word senses is
 moving from WSD towards Supersense tagging where there are recent
 papers and freely available tweet datasets, for example. In any case,
 we can look more into it but in the meantime the Semcor for training
 and senseval/semeval2007 datasets for evaluation should be enough to
 compare your system with the literature.

 
  As Jörn mentioned sending an initial patch, should we separate our codes
  and upload two different patches to the two issues we created on the Jira
  (however, this means a lot of redundancy in the code), or shall we keep
  them in one project and upload it? If we opt for the latter case, which
  issue should we upload the patch to ?

 In my opinion, it should be the same patch and same Component with
 different algorithm implementations within it. Any other opinions?

 Cheers,

 Rodrigo



Re: GSoC 2015 - WSD Module

2015-06-08 Thread Rodrigo Agerri
Hello,

+1 for using extJWNL instead of JWNL, I use it in some other projects
too and it is very nice IMHO.

R

On Sat, Jun 6, 2015 at 12:55 PM, Aliaksandr Autayeu
aliaksa...@autayeu.com wrote:
 Thinking of impartiality... Anyway, I'm the author of extJWNL in case you
 have questions.

 Aliaksandr

 On 6 June 2015 at 11:43, Richard Eckart de Castilho 
 richard.eck...@gmail.com wrote:

 On 05.06.2015, at 14:24, Anthony Beylerian anthonybeyler...@hotmail.com
 wrote:

  So just to make sure, we are currently relying on JWNL to access WordNet
 as a resource.

 There is a more modern fork of JWNL available called
 http://extjwnl.sourceforge.net .
 It includes provisions of loading WordNet from the classpath, e.g.
 from Maven dependencies. It might be a nice replacement for JWNL and is
 also licensed
 under the BSD license. Pre-packaged WordNet Maven artifacts are also
 available.

 Cheers,

 -- Richard


Re: GSoC 2015 - WSD Module

2015-06-08 Thread Mondher Bouazizi
Dear Rodrigo,

As Anthony mentioned in his previous email, I already started the
implementation of the IMS approach. The pre-processing and the extraction
of features have already been finished. Regarding the approach itself, it
shows some potential according to the author though the features proposed
are not so many, and are basic. I think the approach itself might be
enhanced if we add more context specific features from some other
approaches... (To do that, I need to run many experiments using different
combinations of features, however, that should not be a problem).
But the approach itself requires a linear SVM classifier, and as far as I
know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use libsvm
?

Regarding the training data, I started collecting some from different
sources. Most of the existing rich corpora are licensed (Including the ones
mentioned in the paper). The free ones I got for now are from the Senseval
and Semeval websites. However, these are used just to evaluate the proposed
methods in the workshops. Therefore, the words to disambiguate are few in
number though the training data for each word are rich enough.

In any case, the first tests with Senseval and Semeval collected should be
finished soon. However, I am not sure if there is a rich enough Dataset we
can use to make our model for the WSD module in the OpenNLP library.
If you have any recommendation, I would be grateful if you can help me on
this point.

On the other hand, we're cleaning our implementation of the different
variations of Lesk. However, we are currently using JWNL. If there are no
objections, we will migrate to extJWNL.

As Jörn mentioned sending an initial patch, should we separate our codes
and upload two different patches to the two issues we created on the Jira
(however, this means a lot of redundancy in the code), or shall we keep
them in one project and upload it? If we opt for the latter case, which
issue should we upload the patch to ?

Thanks,

Mondher, Anthony

On Mon, Jun 8, 2015 at 7:51 PM, Rodrigo Agerri rage...@apache.org wrote:

 Hello,

 +1 for using extJWNL instead of JWNL, I use it in some other projects
 too and it is very nice IMHO.

 R

 On Sat, Jun 6, 2015 at 12:55 PM, Aliaksandr Autayeu
 aliaksa...@autayeu.com wrote:
  Thinking of impartiality... Anyway, I'm the author of extJWNL in case you
  have questions.
 
  Aliaksandr
 
  On 6 June 2015 at 11:43, Richard Eckart de Castilho 
  richard.eck...@gmail.com wrote:
 
  On 05.06.2015, at 14:24, Anthony Beylerian 
 anthonybeyler...@hotmail.com
  wrote:
 
   So just to make sure, we are currently relying on JWNL to access
 WordNet
  as a resource.
 
  There is a more modern fork of JWNL available called
  http://extjwnl.sourceforge.net .
  It includes provisions of loading WordNet from the classpath, e.g.
  from Maven dependencies. It might be a nice replacement for JWNL and is
  also licensed
  under the BSD license. Pre-packaged WordNet Maven artifacts are also
  available.
 
  Cheers,
 
  -- Richard



Re: GSoC 2015 - WSD Module

2015-06-06 Thread Aliaksandr Autayeu
Thinking of impartiality... Anyway, I'm the author of extJWNL in case you
have questions.

Aliaksandr

On 6 June 2015 at 11:43, Richard Eckart de Castilho 
richard.eck...@gmail.com wrote:

 On 05.06.2015, at 14:24, Anthony Beylerian anthonybeyler...@hotmail.com
 wrote:

  So just to make sure, we are currently relying on JWNL to access WordNet
 as a resource.

 There is a more modern fork of JWNL available called
 http://extjwnl.sourceforge.net .
 It includes provisions of loading WordNet from the classpath, e.g.
 from Maven dependencies. It might be a nice replacement for JWNL and is
 also licensed
 under the BSD license. Pre-packaged WordNet Maven artifacts are also
 available.

 Cheers,

 -- Richard


Re: GSoC 2015 - WSD Module

2015-06-06 Thread Richard Eckart de Castilho
On 05.06.2015, at 14:24, Anthony Beylerian anthonybeyler...@hotmail.com wrote:

 So just to make sure, we are currently relying on JWNL to access WordNet as a 
 resource. 

There is a more modern fork of JWNL available called 
http://extjwnl.sourceforge.net .
It includes provisions of loading WordNet from the classpath, e.g.
from Maven dependencies. It might be a nice replacement for JWNL and is also 
licensed
under the BSD license. Pre-packaged WordNet Maven artifacts are also 
available.

Cheers,

-- Richard

Re: GSoC 2015 - WSD Module

2015-06-05 Thread Joern Kottmann
Hello,

yes, wordnet is fine, we already depend on it. I just think that remote
resources are particular problematic.

For local resources it boils down to their license.

Here is the wordnet one:
http://wordnet.princeton.edu/wordnet/license/

We might even be able to redistribute this here at Apache, which is really
nice. To do that we have to check
with the legal list if they give a green light for it.

You can get more information about licenses and dependencies for Apache
projects here:
http://www.apache.org/legal/resolved.html#category-a
http://www.apache.org/legal/resolved.html#category-b
http://www.apache.org/legal/resolved.html#category-x

Are the things you have to clean up of the nature that you couldn't do that
after you send in a patch?
This could be removal of code which can be released under ASL.

We would like to get you integrated into the way we work here as quickly as
possible.

That includes:
- Tasks are planned/tracked via jira (this allows other people to
comment/follow)
- We would like to be able to review your code and maybe give some advice
(commit often, break things down in tasks)
- Changes or new features are usually discussed a on the dev list (e.g. a
short write up about the approaches you implemented
  or better plan to implement)

Jörn




On Fri, Jun 5, 2015 at 2:24 PM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Hi,

 We understand the issues.

 So just to make sure, we are currently relying on JWNL to access WordNet
 as a resource. Is that fine for now ?

 In case we need to avoid such dependencies,  would it be ok to create a
 resource file that includes what we need extracted from it or also from
 other resources combined (sense inventory, word relationships and so on) ?
 We'd like your recommendation.

 Also we are currently cleaning up the project and will upload a patch.
 To sum up, we have already implemented the Lesk approach, as well as parts
 of the supervised IMS approach (preprocessing, feature extraction).
 Next, we will implement the baseline techniques and collect the training
 data that will be used by supervised approaches.
 Files will be collected from different sources and will be unified in a
 single model file.
 Best regards,

 Anthony, Mondher


  Date: Wed, 3 Jun 2015 16:47:50 +0200
  Subject: Re: GSoC 2015 - WSD Module
  From: kottm...@gmail.com
  To: dev@opennlp.apache.org
 
  We should not use remote resources. A remote service adds severe limits
 to
  the WSD component. A remote resource will be slow to query (compared to
  disk or memory), queries might be expensive (pay per request), the
 license
  might not allow usage in a way the ASL promises to our users. Another
 issue
  is that calling a remote service might leak the document text itself to
  that remote service.
 
  Please attach a patch to the jira issue, and then we can pull it into the
  sandbox.
 
  Jörn
 
 
 
 
 
  On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian 
  anthonybeyler...@hotmail.com wrote:
 
   Dear Jörn,
  
   Thank you for the reply.===
   Yes in the draft WSDisambiguator is the main interface.
   ===
   Yes for the disambiguate method the input is expected to be tokenized,
 it
   should be an input array.
   The second argument is for the token index.  We can also make it into
 an
   index array to support multiple words.
   ===
   Concerning the resources, we expect two types of resources : local and
   remote resources.
  
   + For local resources, we have two main types :
   1- training models for supervised techniques.
   2- knowledge resources
  
   It could be best to make the packaging using similar OpenNLP models
 for #1.
   As for #2, it will depend on what we want to use,  since the type of
   information depends on the specific technique.
  
   + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might
 need
   to have some REST support, for example to retrieve a sense inventory
 for a
   certain word.Actually, the newest semeval task [Semeval15] will use
   [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
   version, but the newest one is only available through REST.Also, in
 case it
   is needed to use a remote resource, AND it typically requires a
 license, we
   need to use a license key or just use the free quota with no key.
  
   Therefore, we thought of having a [ResourceProvider] as mentioned in
 the
   [draft].
   Are there any plans to add an external API connector of the sort or is
   this functionality already possible for extension ?
   (I noticed there is a [wikinews_importer] in the sanbox)
  
   But in any case we can always start working only locally as a first
 step,
   what do you think ?
   ===
   It would be more straightforward to use the algorithm names, so ok why
 not.
   ===
   Yes we have already started working !
   What do we

RE: GSoC 2015 - WSD Module

2015-06-03 Thread Anthony Beylerian
Dear Jörn,

Thank you for the reply.===
Yes in the draft WSDisambiguator is the main interface.
===
Yes for the disambiguate method the input is expected to be tokenized, it 
should be an input array.
The second argument is for the token index.  We can also make it into an index 
array to support multiple words.
===
Concerning the resources, we expect two types of resources : local and remote 
resources.

+ For local resources, we have two main types :
1- training models for supervised techniques.
2- knowledge resources 

It could be best to make the packaging using similar OpenNLP models for #1.
As for #2, it will depend on what we want to use,  since the type of 
information depends on the specific technique.

+ As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need to 
have some REST support, for example to retrieve a sense inventory for a certain 
word.Actually, the newest semeval task [Semeval15] will use [BabelNet] for WSD 
and EL (Entity Linking).[BabelNet] has an offline version, but the newest one 
is only available through REST.Also, in case it is needed to use a remote 
resource, AND it typically requires a license, we need to use a license key or 
just use the free quota with no key.

Therefore, we thought of having a [ResourceProvider] as mentioned in the 
[draft]. 
Are there any plans to add an external API connector of the sort or is this 
functionality already possible for extension ?
(I noticed there is a [wikinews_importer] in the sanbox)

But in any case we can always start working only locally as a first step, what 
do you think ?
===
It would be more straightforward to use the algorithm names, so ok why not.
===
Yes we have already started working !
What do we need to push to the sandbox ?
===

Thanks !

Anthony 

[BabelNet] : http://babelnet.org/download
[WordsAPI] : https://www.wordsapi.com/
[Semeval15] : http://alt.qcri.org/semeval2015/task13/
[draft] : 
https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1


 Subject: Re: GSoC 2015 - WSD Module
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 Date: Mon, 1 Jun 2015 20:30:08 +0200
 
 Hello,
 
 I had a look at your APIs.
 
 Lets start with the WSDisambiguator. Should that be an interface?
 
 // returns the senses ordered by their score (best one first or only 1
 in supervised case)
 String[] disambiguate(String inputText,int inputWordposition);
 
 Shouldn't we have a tokenized input? Or is the inputText a token?
 
 If you have resources you could package those into OpenNLP models and
 use the existing serialization support. Would that work for you?
 
 I think we should have different implementing classes for different
 algorithms rather than grouping that in the Supervised and Unsupervised
 classes. And also use the algorithm / approach name as part of the class
 name.
 
 As far as I understand you already started to work on this. Should we an
 initial code drop into the sandbox, and then work out things from there?
 We strongly prefer to have as much as possible source code editing
 history in our version control system.
 
 Jörn 
  

Re: GSoC 2015 - WSD Module

2015-06-03 Thread Joern Kottmann
We should not use remote resources. A remote service adds severe limits to
the WSD component. A remote resource will be slow to query (compared to
disk or memory), queries might be expensive (pay per request), the license
might not allow usage in a way the ASL promises to our users. Another issue
is that calling a remote service might leak the document text itself to
that remote service.

Please attach a patch to the jira issue, and then we can pull it into the
sandbox.

Jörn





On Wed, Jun 3, 2015 at 1:34 PM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Dear Jörn,

 Thank you for the reply.===
 Yes in the draft WSDisambiguator is the main interface.
 ===
 Yes for the disambiguate method the input is expected to be tokenized, it
 should be an input array.
 The second argument is for the token index.  We can also make it into an
 index array to support multiple words.
 ===
 Concerning the resources, we expect two types of resources : local and
 remote resources.

 + For local resources, we have two main types :
 1- training models for supervised techniques.
 2- knowledge resources

 It could be best to make the packaging using similar OpenNLP models for #1.
 As for #2, it will depend on what we want to use,  since the type of
 information depends on the specific technique.

 + As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need
 to have some REST support, for example to retrieve a sense inventory for a
 certain word.Actually, the newest semeval task [Semeval15] will use
 [BabelNet] for WSD and EL (Entity Linking).[BabelNet] has an offline
 version, but the newest one is only available through REST.Also, in case it
 is needed to use a remote resource, AND it typically requires a license, we
 need to use a license key or just use the free quota with no key.

 Therefore, we thought of having a [ResourceProvider] as mentioned in the
 [draft].
 Are there any plans to add an external API connector of the sort or is
 this functionality already possible for extension ?
 (I noticed there is a [wikinews_importer] in the sanbox)

 But in any case we can always start working only locally as a first step,
 what do you think ?
 ===
 It would be more straightforward to use the algorithm names, so ok why not.
 ===
 Yes we have already started working !
 What do we need to push to the sandbox ?
 ===

 Thanks !

 Anthony

 [BabelNet] : http://babelnet.org/download
 [WordsAPI] : https://www.wordsapi.com/
 [Semeval15] : http://alt.qcri.org/semeval2015/task13/
 [draft] :
 https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1


  Subject: Re: GSoC 2015 - WSD Module
  From: kottm...@gmail.com
  To: dev@opennlp.apache.org
  Date: Mon, 1 Jun 2015 20:30:08 +0200
 
  Hello,
 
  I had a look at your APIs.
 
  Lets start with the WSDisambiguator. Should that be an interface?
 
  // returns the senses ordered by their score (best one first or only 1
  in supervised case)
  String[] disambiguate(String inputText,int inputWordposition);
 
  Shouldn't we have a tokenized input? Or is the inputText a token?
 
  If you have resources you could package those into OpenNLP models and
  use the existing serialization support. Would that work for you?
 
  I think we should have different implementing classes for different
  algorithms rather than grouping that in the Supervised and Unsupervised
  classes. And also use the algorithm / approach name as part of the class
  name.
 
  As far as I understand you already started to work on this. Should we an
  initial code drop into the sandbox, and then work out things from there?
  We strongly prefer to have as much as possible source code editing
  history in our version control system.
 
  Jörn




Re: GSoC 2015 - WSD Module

2015-06-01 Thread Joern Kottmann
Hello,

I had a look at your APIs.

Lets start with the WSDisambiguator. Should that be an interface?

// returns the senses ordered by their score (best one first or only 1
in supervised case)
String[] disambiguate(String inputText,int inputWordposition);

Shouldn't we have a tokenized input? Or is the inputText a token?

If you have resources you could package those into OpenNLP models and
use the existing serialization support. Would that work for you?

I think we should have different implementing classes for different
algorithms rather than grouping that in the Supervised and Unsupervised
classes. And also use the algorithm / approach name as part of the class
name.

As far as I understand you already started to work on this. Should we an
initial code drop into the sandbox, and then work out things from there?
We strongly prefer to have as much as possible source code editing
history in our version control system.

Jörn 

On Sat, 2015-05-23 at 01:44 +0900, Anthony Beylerian wrote:
 Hello,
 
 Thank you for the feedback.
 
 Please use this link to access a quick draft of the interface :
 https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1
 
 I believe the previously mentioned link was not allowing for document updates.
 
 As for the common interface, since supervised methods rely on classifiers 
 they will need to load/save the training models, so we will need to separate 
 the two, maybe as in the draft.
 However we could keep a parent class with a common [disambiguate] method that 
 can be used for evaluation tasks and others.
 
 Thanks !
 
 Anthony
 
 
 
  Date: Fri, 22 May 2015 09:18:39 +0200
  Subject: Re: GSoC 2015 - WSD Module
  From: kottm...@gmail.com
  To: dev@opennlp.apache.org
  
  Hello,
  
  one of the tasks we should start is, is to define the interface for the WSD
  component.
  
  Please have a look at the other components in OpenNLP and try to propose an
  interface in a similar style.
  Can we use one interface for all the different implementations?
  
  Jörn
  
  
  On Mon, May 18, 2015 at 3:27 PM, Mondher Bouazizi 
  mondher.bouaz...@gmail.com wrote:
  
   Dear all,
  
   Sorry if you received multiple copies of this email (The links were
   embedded). Here are the actual links:
  
   *Figure:*
  
   https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
   *Semeval/senseval results summary:*
  
   https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
   *Literature survey of WSD techniques:*
  
   https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing
  
   Yours faithfully
  
   On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian 
   anthonybeyler...@hotmail.com wrote:
  
Please excuse the duplicate email, we could not attach the mentioned
figure.
Kindly find it here.
Thank you.
   
From: anthonybeyler...@hotmail.com
To: dev@opennlp.apache.org
Subject: GSoC 2015 - WSD Module
Date: Mon, 18 May 2015 22:14:43 +0900
   
   
   
   
Dear all,
In the context of building a Word Sense Disambiguation (WSD) module,
   after
doing a survey on WSD techniques, we realized the following points :
- WSD techniques can be split into three sets (supervised,
unsupervised/knowledge based, hybrid) - WSD is used for different
   directly
related objectives such as all-words disambiguation, lexical sample
disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval
   seem
to be good references to compare different techniques for WSD since many
   of
them were tested on the same data (but different one each event).- For
   the
sake of making a first solution, we propose to start with supporting the
lexical sample type of disambiguation, meaning to disambiguate
single/limited word(s) from an input text.
Therefore, we have decided to collect information about the different
techniques in the literature (such as  references, performance,
   parameters
etc.) in this spreadsheet here.Otherwise we have also collected the
   results
of all the senseval/semeval exercises here.(Note that each document has
many sheets)The collected results, could help decide on which techniques
   to
start with as main models for each set of techniques
(supervised/unsupervised).
We also propose a general approach for the package in the figure
attached.The main components are as follows :
1- The different resources publicly available : WordNet, BabelNet,
Wikipedia, etc.However, we would also like to allow the users to use
   their
own local resources, by maybe defining a type of connector to the
   resource
interface.
2- The resource interface will have the role to provide both a sense
inventory that the user can query and a knowledge base (such as semantic
   or
syntactic info. etc.) that might be used depending

Re: GSoC 2015 - WSD Module

2015-05-22 Thread Joern Kottmann
Hello,

one of the tasks we should start is, is to define the interface for the WSD
component.

Please have a look at the other components in OpenNLP and try to propose an
interface in a similar style.
Can we use one interface for all the different implementations?

Jörn


On Mon, May 18, 2015 at 3:27 PM, Mondher Bouazizi 
mondher.bouaz...@gmail.com wrote:

 Dear all,

 Sorry if you received multiple copies of this email (The links were
 embedded). Here are the actual links:

 *Figure:*

 https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
 *Semeval/senseval results summary:*

 https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
 *Literature survey of WSD techniques:*

 https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing

 Yours faithfully

 On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian 
 anthonybeyler...@hotmail.com wrote:

  Please excuse the duplicate email, we could not attach the mentioned
  figure.
  Kindly find it here.
  Thank you.
 
  From: anthonybeyler...@hotmail.com
  To: dev@opennlp.apache.org
  Subject: GSoC 2015 - WSD Module
  Date: Mon, 18 May 2015 22:14:43 +0900
 
 
 
 
  Dear all,
  In the context of building a Word Sense Disambiguation (WSD) module,
 after
  doing a survey on WSD techniques, we realized the following points :
  - WSD techniques can be split into three sets (supervised,
  unsupervised/knowledge based, hybrid) - WSD is used for different
 directly
  related objectives such as all-words disambiguation, lexical sample
  disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval
 seem
  to be good references to compare different techniques for WSD since many
 of
  them were tested on the same data (but different one each event).- For
 the
  sake of making a first solution, we propose to start with supporting the
  lexical sample type of disambiguation, meaning to disambiguate
  single/limited word(s) from an input text.
  Therefore, we have decided to collect information about the different
  techniques in the literature (such as  references, performance,
 parameters
  etc.) in this spreadsheet here.Otherwise we have also collected the
 results
  of all the senseval/semeval exercises here.(Note that each document has
  many sheets)The collected results, could help decide on which techniques
 to
  start with as main models for each set of techniques
  (supervised/unsupervised).
  We also propose a general approach for the package in the figure
  attached.The main components are as follows :
  1- The different resources publicly available : WordNet, BabelNet,
  Wikipedia, etc.However, we would also like to allow the users to use
 their
  own local resources, by maybe defining a type of connector to the
 resource
  interface.
  2- The resource interface will have the role to provide both a sense
  inventory that the user can query and a knowledge base (such as semantic
 or
  syntactic info. etc.) that might be used depending on the technique.We
  might even later consider building a local cache for remote services.
  3- The WSD algorithms/techniques themselves that will make use of the
  resource interface to access the resources required.These techniques will
  be split into two main packages as in the left side of the figure :
  Supervised/Unsupervised.The utils package includes common tools used in
  both types of techniques.The details mentioned in each package should be
  common to all implementations of these abstract models.
  4- I/O could be processed in different formats (XML/JSON etc) or a
 simpler
  structure following your recommendations.
  If you have any suggestions or recommendations, we would really
 appreciate
  discussing them and would like your guidance to iterate on this tool-set.
  Best regards,
 
  Anthony Beylerian, Mondher Bouazizi
 



Re: GSoC 2015 - WSD Module

2015-05-22 Thread Rodrigo Agerri
Hello Mondher (my response is about supervised WSD),

Thanks for the info, it is quite interesting. Apart from the comment
by Jörn, which I think is very important if we want to achieve
something given the time constrains of the GSOC, I have a couple of
recommendations/comments from my part:

1. Rather than targeting Lexical Sample task or all words WSD I think
it could be more operative to choose an approach/algorithm and try to
implement it in OpenNLP. One of the most (it not the most) popular
approaches is the it Makes Sense (IMS) system

http://www.comp.nus.edu.sg/~nlp/sw/README.txt
https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf

That I think is achievable in the GSOC time frame.

2. As an aside, research has been moving towards supersense tagging
(SST), given the dificulty of WSD.

http://ttic.uchicago.edu/~altun/pubs/CiaAlt_EMNLP06.pdf

As you can see in the above paper, SST is approached as a sequence
labelling task, rather than classification. This means that we could
reimplement Ciaramita and Altun (2006) features implementing the
AdaptiveFeatureGenerators and creating a module structurally similar
to the NameFinder but for SST.

This has also the advantage of being able to move to datasets that are
not old Semcor and senseval and using current Tweet datasets and so
on. See this recent paper on SST on tweets:

http://aclweb.org/anthology/S14-1001

I think that for supervised WSD, we should pursue option 1. or 2. and
start definining the interface as Jörn has suggested.

Best,

Rodrigo

On Mon, May 18, 2015 at 2:14 PM, Anthony Beylerian
anthonybeyler...@hotmail.com wrote:
 Dear all,

 In the context of building a Word Sense Disambiguation (WSD) module, after
 doing a survey on WSD techniques, we realized the following points :

 - WSD techniques can be split into three sets (supervised,
 unsupervised/knowledge based, hybrid)

 - WSD is used for different directly related objectives such as all-words
 disambiguation, lexical sample disambiguation, multi/cross-lingual
 approaches etc.

 - Senseval/Semeval seem to be good references to compare different
 techniques for WSD since many of them were tested on the same data (but
 different one each event).

 - For the sake of making a first solution, we propose to start with
 supporting the lexical sample type of disambiguation, meaning to
 disambiguate single/limited word(s) from an input text.


 Therefore, we have decided to collect information about the different
 techniques in the literature (such as  references, performance, parameters
 etc.) in this spreadsheet here.
 Otherwise we have also collected the results of all the senseval/semeval
 exercises here.
 (Note that each document has many sheets)
 The collected results, could help decide on which techniques to start with
 as main models for each set of techniques (supervised/unsupervised).

 We also propose a general approach for the package in the figure attached.
 The main components are as follows :

 1- The different resources publicly available : WordNet, BabelNet,
 Wikipedia, etc.
 However, we would also like to allow the users to use their own local
 resources, by maybe defining a type of connector to the resource interface.

 2- The resource interface will have the role to provide both a sense
 inventory that the user can query and a knowledge base (such as semantic or
 syntactic info. etc.) that might be used depending on the technique.
 We might even later consider building a local cache for remote services.

 3- The WSD algorithms/techniques themselves that will make use of the
 resource interface to access the resources required.
 These techniques will be split into two main packages as in the left side of
 the figure :  Supervised/Unsupervised.
 The utils package includes common tools used in both types of techniques.
 The details mentioned in each package should be common to all
 implementations of these abstract models.

 4- I/O could be processed in different formats (XML/JSON etc) or a simpler
 structure following your recommendations.

 If you have any suggestions or recommendations, we would really appreciate
 discussing them and would like your guidance to iterate on this tool-set.

 Best regards,

 Anthony Beylerian, Mondher Bouazizi


Re: GSoC 2015 - WSD Module

2015-05-22 Thread Mondher Bouazizi
Hi all,

Thanks Rodrigo for the feedback.
I don't mind starting with IMS implementation as a first supervised
solution.
It seems to a good first step.
As for the SST, I will read more about it and will let you know.

On the other hand, how about the following interface Anthony and myself
prepared based on Jörn's recommendation.
We tried to be as close as possible to the other tools already implemented.

Link :
https://drive.google.com/file/d/0B7ON7bq1zRm3NTI1bGFfc3lZX0U/view?usp=sharing

Best regards,

Mondher, Anthony



On Fri, May 22, 2015 at 9:59 PM, Rodrigo Agerri rage...@apache.org wrote:

 Hello Mondher (my response is about supervised WSD),

 Thanks for the info, it is quite interesting. Apart from the comment
 by Jörn, which I think is very important if we want to achieve
 something given the time constrains of the GSOC, I have a couple of
 recommendations/comments from my part:

 1. Rather than targeting Lexical Sample task or all words WSD I think
 it could be more operative to choose an approach/algorithm and try to
 implement it in OpenNLP. One of the most (it not the most) popular
 approaches is the it Makes Sense (IMS) system

 http://www.comp.nus.edu.sg/~nlp/sw/README.txt
 https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf

 That I think is achievable in the GSOC time frame.

 2. As an aside, research has been moving towards supersense tagging
 (SST), given the dificulty of WSD.

 http://ttic.uchicago.edu/~altun/pubs/CiaAlt_EMNLP06.pdf

 As you can see in the above paper, SST is approached as a sequence
 labelling task, rather than classification. This means that we could
 reimplement Ciaramita and Altun (2006) features implementing the
 AdaptiveFeatureGenerators and creating a module structurally similar
 to the NameFinder but for SST.

 This has also the advantage of being able to move to datasets that are
 not old Semcor and senseval and using current Tweet datasets and so
 on. See this recent paper on SST on tweets:

 http://aclweb.org/anthology/S14-1001

 I think that for supervised WSD, we should pursue option 1. or 2. and
 start definining the interface as Jörn has suggested.

 Best,

 Rodrigo

 On Mon, May 18, 2015 at 2:14 PM, Anthony Beylerian
 anthonybeyler...@hotmail.com wrote:
  Dear all,
 
  In the context of building a Word Sense Disambiguation (WSD) module,
 after
  doing a survey on WSD techniques, we realized the following points :
 
  - WSD techniques can be split into three sets (supervised,
  unsupervised/knowledge based, hybrid)
 
  - WSD is used for different directly related objectives such as all-words
  disambiguation, lexical sample disambiguation, multi/cross-lingual
  approaches etc.
 
  - Senseval/Semeval seem to be good references to compare different
  techniques for WSD since many of them were tested on the same data (but
  different one each event).
 
  - For the sake of making a first solution, we propose to start with
  supporting the lexical sample type of disambiguation, meaning to
  disambiguate single/limited word(s) from an input text.
 
 
  Therefore, we have decided to collect information about the different
  techniques in the literature (such as  references, performance,
 parameters
  etc.) in this spreadsheet here.
  Otherwise we have also collected the results of all the senseval/semeval
  exercises here.
  (Note that each document has many sheets)
  The collected results, could help decide on which techniques to start
 with
  as main models for each set of techniques (supervised/unsupervised).
 
  We also propose a general approach for the package in the figure
 attached.
  The main components are as follows :
 
  1- The different resources publicly available : WordNet, BabelNet,
  Wikipedia, etc.
  However, we would also like to allow the users to use their own local
  resources, by maybe defining a type of connector to the resource
 interface.
 
  2- The resource interface will have the role to provide both a sense
  inventory that the user can query and a knowledge base (such as semantic
 or
  syntactic info. etc.) that might be used depending on the technique.
  We might even later consider building a local cache for remote services.
 
  3- The WSD algorithms/techniques themselves that will make use of the
  resource interface to access the resources required.
  These techniques will be split into two main packages as in the left
 side of
  the figure :  Supervised/Unsupervised.
  The utils package includes common tools used in both types of techniques.
  The details mentioned in each package should be common to all
  implementations of these abstract models.
 
  4- I/O could be processed in different formats (XML/JSON etc) or a
 simpler
  structure following your recommendations.
 
  If you have any suggestions or recommendations, we would really
 appreciate
  discussing them and would like your guidance to iterate on this tool-set.
 
  Best regards,
 
  Anthony Beylerian, Mondher Bouazizi



Re: GSoC 2015 - WSD Module

2015-05-18 Thread Mondher Bouazizi
Dear all,

Sorry if you received multiple copies of this email (The links were
embedded). Here are the actual links:

*Figure:*
https://drive.google.com/file/d/0B7ON7bq1zRm3Sm1YYktJTVctLWs/view?usp=sharing
*Semeval/senseval results summary:*
https://docs.google.com/spreadsheets/d/1NCiwXBQs0rxUwtZ3tiwx9FZ4WELIfNCkMKp8rlnKObY/edit?usp=sharing
*Literature survey of WSD techniques:*
https://docs.google.com/spreadsheets/d/1WQbJNeaKjoT48iS_7oR8ifZlrd4CfhU1Tay_LLPtlCM/edit?usp=sharing

Yours faithfully

On Mon, May 18, 2015 at 10:17 PM, Anthony Beylerian 
anthonybeyler...@hotmail.com wrote:

 Please excuse the duplicate email, we could not attach the mentioned
 figure.
 Kindly find it here.
 Thank you.

 From: anthonybeyler...@hotmail.com
 To: dev@opennlp.apache.org
 Subject: GSoC 2015 - WSD Module
 Date: Mon, 18 May 2015 22:14:43 +0900




 Dear all,
 In the context of building a Word Sense Disambiguation (WSD) module, after
 doing a survey on WSD techniques, we realized the following points :
 - WSD techniques can be split into three sets (supervised,
 unsupervised/knowledge based, hybrid) - WSD is used for different directly
 related objectives such as all-words disambiguation, lexical sample
 disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval seem
 to be good references to compare different techniques for WSD since many of
 them were tested on the same data (but different one each event).- For the
 sake of making a first solution, we propose to start with supporting the
 lexical sample type of disambiguation, meaning to disambiguate
 single/limited word(s) from an input text.
 Therefore, we have decided to collect information about the different
 techniques in the literature (such as  references, performance, parameters
 etc.) in this spreadsheet here.Otherwise we have also collected the results
 of all the senseval/semeval exercises here.(Note that each document has
 many sheets)The collected results, could help decide on which techniques to
 start with as main models for each set of techniques
 (supervised/unsupervised).
 We also propose a general approach for the package in the figure
 attached.The main components are as follows :
 1- The different resources publicly available : WordNet, BabelNet,
 Wikipedia, etc.However, we would also like to allow the users to use their
 own local resources, by maybe defining a type of connector to the resource
 interface.
 2- The resource interface will have the role to provide both a sense
 inventory that the user can query and a knowledge base (such as semantic or
 syntactic info. etc.) that might be used depending on the technique.We
 might even later consider building a local cache for remote services.
 3- The WSD algorithms/techniques themselves that will make use of the
 resource interface to access the resources required.These techniques will
 be split into two main packages as in the left side of the figure :
 Supervised/Unsupervised.The utils package includes common tools used in
 both types of techniques.The details mentioned in each package should be
 common to all implementations of these abstract models.
 4- I/O could be processed in different formats (XML/JSON etc) or a simpler
 structure following your recommendations.
 If you have any suggestions or recommendations, we would really appreciate
 discussing them and would like your guidance to iterate on this tool-set.
 Best regards,

 Anthony Beylerian, Mondher Bouazizi



RE: GSoC 2015 - WSD Module

2015-05-18 Thread Anthony Beylerian
Please excuse the duplicate email, we could not attach the mentioned figure. 
Kindly find it here.
Thank you.

From: anthonybeyler...@hotmail.com
To: dev@opennlp.apache.org
Subject: GSoC 2015 - WSD Module
Date: Mon, 18 May 2015 22:14:43 +0900




Dear all,
In the context of building a Word Sense Disambiguation (WSD) module, after 
doing a survey on WSD techniques, we realized the following points :
- WSD techniques can be split into three sets (supervised, 
unsupervised/knowledge based, hybrid) - WSD is used for different directly 
related objectives such as all-words disambiguation, lexical sample 
disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval seem to 
be good references to compare different techniques for WSD since many of them 
were tested on the same data (but different one each event).- For the sake of 
making a first solution, we propose to start with supporting the lexical 
sample type of disambiguation, meaning to disambiguate single/limited word(s) 
from an input text.
Therefore, we have decided to collect information about the different 
techniques in the literature (such as  references, performance, parameters 
etc.) in this spreadsheet here.Otherwise we have also collected the results of 
all the senseval/semeval exercises here.(Note that each document has many 
sheets)The collected results, could help decide on which techniques to start 
with as main models for each set of techniques (supervised/unsupervised).
We also propose a general approach for the package in the figure attached.The 
main components are as follows : 
1- The different resources publicly available : WordNet, BabelNet, Wikipedia, 
etc.However, we would also like to allow the users to use their own local 
resources, by maybe defining a type of connector to the resource interface.
2- The resource interface will have the role to provide both a sense inventory 
that the user can query and a knowledge base (such as semantic or syntactic 
info. etc.) that might be used depending on the technique.We might even later 
consider building a local cache for remote services. 
3- The WSD algorithms/techniques themselves that will make use of the resource 
interface to access the resources required.These techniques will be split into 
two main packages as in the left side of the figure :  
Supervised/Unsupervised.The utils package includes common tools used in both 
types of techniques.The details mentioned in each package should be common to 
all implementations of these abstract models.
4- I/O could be processed in different formats (XML/JSON etc) or a simpler 
structure following your recommendations.
If you have any suggestions or recommendations, we would really appreciate 
discussing them and would like your guidance to iterate on this tool-set.
Best regards,

Anthony Beylerian, Mondher Bouazizi