Hi,
I attached an initial patch to OPENNLP-758.
However, we are currently modifying things a bit since many approaches need to
be supported, but would like your recommendations.
Here are some notes :
1 - We used extJWNL
2- [WSDisambiguator] is the main interface
3- [Loader] loads the resources required
4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
5- [Lesk] has many variants, we already implemented some, but wondering on the
preferred way to switch from one to the other:
As of now we use one of them as default, but we thought of either making a
parameter list to fill or make separate classes for each, or otherwise
following your preference.
6- The other classes are for convenience.
We will try to patch frequently on the separate issues, following the feedback.
Best regards,
Anthony
> Date: Wed, 10 Jun 2015 11:42:56 +0200
> Subject: Re: GSoC 2015 - WSD Module
> From: kottm...@gmail.com
> To: dev@opennlp.apache.org
>
> You can attach the patch to one of the issues, you can create an new issue.
> In the end it doesn't matter much, but important is that we make progress
> here and get the initial code into our repository. Subsequent changes can
> then be done in a patch series.
>
> Please try to submit the patch as quickly as possible.
>
> Jörn
>
> On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri wrote:
>
> > Hello,
> >
> > On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
> > wrote:
> > > Dear Rodrigo,
> > >
> > > As Anthony mentioned in his previous email, I already started the
> > > implementation of the IMS approach. The pre-processing and the extraction
> > > of features have already been finished. Regarding the approach itself, it
> > > shows some potential according to the author though the features proposed
> > > are not so many, and are basic.
> >
> > Hi, yes, the features are not that complex, but it is good to have a
> > working system and then if needed the feature set can be
> > improved/enriched. As stated in the paper, the IMS approach leverages
> > parallel data to obtain state of the art results in both lexical
> > sample and all words for senseval 3 and semeval 2007 datasets.
> >
> > I think it will be nice to have a working system with this algorithm
> > as part of the WSD component in OpenNLP (following the API discussion
> > previous in this thread) and perform some evaluations to know where
> > the system is with respect to state of the art results in those
> > datasets. Once this is operative, I think it will be a good moment to
> > start discussing additional/better features.
> >
> > > I think the approach itself might be
> > > enhanced if we add more context specific features from some other
> > > approaches... (To do that, I need to run many experiments using different
> > > combinations of features, however, that should not be a problem).
> >
> > Speaking about the feature sets, in the API google doc I have not seen
> > anything about the implementation of the feature extractors, could you
> > perhaps provide some extra info (in that same document, for example)
> > about that?
> >
> > > But the approach itself requires a linear SVM classifier, and as far as I
> > > know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
> > libsvm
> > > ?
> >
> > I think you can try with a MaxEnt to start with and in the meantime,
> > @Jörn has commented sometimes that there is a plugin component in
> > OpenNLP to use third-party ML libraries and that he tested it with
> > Mallet. Perhaps he could comment on this to use that functionality to
> > use SVMs.
> >
> > >
> > > Regarding the training data, I started collecting some from different
> > > sources. Most of the existing rich corpora are licensed (Including the
> > ones
> > > mentioned in the paper). The free ones I got for now are from the
> > Senseval
> > > and Semeval websites. However, these are used just to evaluate the
> > proposed
> > > methods in the workshops. Therefore, the words to disambiguate are few in
> > > number though the training data for each word are rich enough.
> > >
> > > In any case, the first tests with Senseval and Semeval collected should
> > be
> > > finished soon. However, I am not sure if there is a rich enough Dataset
> > we
> > > can use to make our model for the WSD module in the OpenNLP library.
> > > If you have any recommendation, I would be grateful if you can help me on
> > > this point.
> >
> > Well, as I said in my previous email, research around "word senses" is
> > moving from WSD towards Supersense tagging where there are recent
> > papers and freely available tweet datasets, for example. In any case,
> > we can look more into it but in the meantime the Semcor for training
> > and senseval/semeval2007 datasets for evaluation should be enough to
> > compare your system with the literature.
> >
> > >
> > > As Jörn mentioned sending an initial patch, should we separate our codes
> > > and upload two different patches to the two issues we created on th