RE: GSoC 2015 - WSD Module

2015-06-10 Thread Anthony Beylerian
Hi,

I attached an initial patch to OPENNLP-758.
However, we are currently modifying things a bit since many approaches need to 
be supported, but would like your recommendations.
Here are some notes : 

1 - We used extJWNL
2- [WSDisambiguator] is the main interface
3- [Loader] loads the resources required
4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
5- [Lesk] has many variants, we already implemented some, but wondering on the 
preferred way to switch from one to the other:
As of now we use one of them as default, but we thought of either making a 
parameter list to fill or make separate classes for each, or otherwise 
following your preference.
6- The other classes are for convenience.

We will try to patch frequently on the separate issues, following the feedback.

Best regards,

Anthony

> Date: Wed, 10 Jun 2015 11:42:56 +0200
> Subject: Re: GSoC 2015 - WSD Module
> From: kottm...@gmail.com
> To: dev@opennlp.apache.org
> 
> You can attach the patch to one of the issues, you can create an new issue.
> In the end it doesn't matter much, but important is that we make progress
> here and get the initial code into our repository. Subsequent changes can
> then be done in a patch series.
> 
> Please try to submit the patch as quickly as possible.
> 
> Jörn
> 
> On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri  wrote:
> 
> > Hello,
> >
> > On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
> >  wrote:
> > > Dear Rodrigo,
> > >
> > > As Anthony mentioned in his previous email, I already started the
> > > implementation of the IMS approach. The pre-processing and the extraction
> > > of features have already been finished. Regarding the approach itself, it
> > > shows some potential according to the author though the features proposed
> > > are not so many, and are basic.
> >
> > Hi, yes, the features are not that complex, but it is good to have a
> > working system and then if needed the feature set can be
> > improved/enriched. As stated in the paper, the IMS approach leverages
> > parallel data to obtain state of the art results in both lexical
> > sample and all words for senseval 3 and semeval 2007 datasets.
> >
> > I think it will be nice to have a working system with this algorithm
> > as part of the WSD component in OpenNLP (following the API discussion
> > previous in this thread) and perform some evaluations to know where
> > the system is with respect to state of the art results in those
> > datasets. Once this is operative, I think it will be a good moment to
> > start discussing additional/better features.
> >
> > > I think the approach itself might be
> > > enhanced if we add more context specific features from some other
> > > approaches... (To do that, I need to run many experiments using different
> > > combinations of features, however, that should not be a problem).
> >
> > Speaking about the feature sets, in the API google doc I have not seen
> > anything about the implementation of the feature extractors, could you
> > perhaps provide some extra info (in that same document, for example)
> > about that?
> >
> > > But the approach itself requires a linear SVM classifier, and as far as I
> > > know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
> > libsvm
> > > ?
> >
> > I think you can try with a MaxEnt to start with and in the meantime,
> > @Jörn has commented sometimes that there is a plugin component in
> > OpenNLP to use third-party ML libraries and that he tested it with
> > Mallet. Perhaps he could comment on this to use that functionality to
> > use SVMs.
> >
> > >
> > > Regarding the training data, I started collecting some from different
> > > sources. Most of the existing rich corpora are licensed (Including the
> > ones
> > > mentioned in the paper). The free ones I got for now are from the
> > Senseval
> > > and Semeval websites. However, these are used just to evaluate the
> > proposed
> > > methods in the workshops. Therefore, the words to disambiguate are few in
> > > number though the training data for each word are rich enough.
> > >
> > > In any case, the first tests with Senseval and Semeval collected should
> > be
> > > finished soon. However, I am not sure if there is a rich enough Dataset
> > we
> > > can use to make our model for the WSD module in the OpenNLP library.
> > > If you have any recommendation, I would be grateful if you can help me on
> > > this point.
> >
> > Well, as I said in my previous email, research around "word senses" is
> > moving from WSD towards Supersense tagging where there are recent
> > papers and freely available tweet datasets, for example. In any case,
> > we can look more into it but in the meantime the Semcor for training
> > and senseval/semeval2007 datasets for evaluation should be enough to
> > compare your system with the literature.
> >
> > >
> > > As Jörn mentioned sending an initial patch, should we separate our codes
> > > and upload two different patches to the two issues we created on th

Re: GSoC 2015 - WSD Module

2015-06-10 Thread Joern Kottmann
You can attach the patch to one of the issues, you can create an new issue.
In the end it doesn't matter much, but important is that we make progress
here and get the initial code into our repository. Subsequent changes can
then be done in a patch series.

Please try to submit the patch as quickly as possible.

Jörn

On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri  wrote:

> Hello,
>
> On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
>  wrote:
> > Dear Rodrigo,
> >
> > As Anthony mentioned in his previous email, I already started the
> > implementation of the IMS approach. The pre-processing and the extraction
> > of features have already been finished. Regarding the approach itself, it
> > shows some potential according to the author though the features proposed
> > are not so many, and are basic.
>
> Hi, yes, the features are not that complex, but it is good to have a
> working system and then if needed the feature set can be
> improved/enriched. As stated in the paper, the IMS approach leverages
> parallel data to obtain state of the art results in both lexical
> sample and all words for senseval 3 and semeval 2007 datasets.
>
> I think it will be nice to have a working system with this algorithm
> as part of the WSD component in OpenNLP (following the API discussion
> previous in this thread) and perform some evaluations to know where
> the system is with respect to state of the art results in those
> datasets. Once this is operative, I think it will be a good moment to
> start discussing additional/better features.
>
> > I think the approach itself might be
> > enhanced if we add more context specific features from some other
> > approaches... (To do that, I need to run many experiments using different
> > combinations of features, however, that should not be a problem).
>
> Speaking about the feature sets, in the API google doc I have not seen
> anything about the implementation of the feature extractors, could you
> perhaps provide some extra info (in that same document, for example)
> about that?
>
> > But the approach itself requires a linear SVM classifier, and as far as I
> > know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
> libsvm
> > ?
>
> I think you can try with a MaxEnt to start with and in the meantime,
> @Jörn has commented sometimes that there is a plugin component in
> OpenNLP to use third-party ML libraries and that he tested it with
> Mallet. Perhaps he could comment on this to use that functionality to
> use SVMs.
>
> >
> > Regarding the training data, I started collecting some from different
> > sources. Most of the existing rich corpora are licensed (Including the
> ones
> > mentioned in the paper). The free ones I got for now are from the
> Senseval
> > and Semeval websites. However, these are used just to evaluate the
> proposed
> > methods in the workshops. Therefore, the words to disambiguate are few in
> > number though the training data for each word are rich enough.
> >
> > In any case, the first tests with Senseval and Semeval collected should
> be
> > finished soon. However, I am not sure if there is a rich enough Dataset
> we
> > can use to make our model for the WSD module in the OpenNLP library.
> > If you have any recommendation, I would be grateful if you can help me on
> > this point.
>
> Well, as I said in my previous email, research around "word senses" is
> moving from WSD towards Supersense tagging where there are recent
> papers and freely available tweet datasets, for example. In any case,
> we can look more into it but in the meantime the Semcor for training
> and senseval/semeval2007 datasets for evaluation should be enough to
> compare your system with the literature.
>
> >
> > As Jörn mentioned sending an initial patch, should we separate our codes
> > and upload two different patches to the two issues we created on the Jira
> > (however, this means a lot of redundancy in the code), or shall we keep
> > them in one project and upload it? If we opt for the latter case, which
> > issue should we upload the patch to ?
>
> In my opinion, it should be the same patch and same Component with
> different algorithm implementations within it. Any other opinions?
>
> Cheers,
>
> Rodrigo
>