Fwd: Word sense disambiguation

2018-02-24 Thread Anthony Beylerian
Hey Cristian,

We have tried different approaches such as:

- Lesk (original) [1]
- Most frequent sense from the data (MFS)
- Extended Lesk (with different scoring functions)
- It makes sense (IMS) [2]
- A sense clustering approach (I don't immediately recall the reference)

Lesk and MFS are meant to be used as baselines for evaluation purpose only.
The extended version of Lesk is an effort to improve the original, through
additional information from semantic relationships.
Although it's not very accurate, it could be useful since it is an
unsupervised method (no need for large training data).
However, there were some caveats, as both approaches need to pre-load
dictionaries as well as score a semantic graph from WordNet at runtime.

IMS is a supervised method which we were hoping to mainly use, since it
scored around 80% accuracy on SemEval, however that is only for the
coarse-grained case. However, in reality words have various degrees of
polysemy, and when tested in the fine-grained case the results were much
lower.
We have also experimented with a simple clustering approach but the
improvements were not considerable as far as I remember.

I just checked the latest results on Semeval2015 [3] and they look a bit
improved on the fine-grained case ~65% F1.
However, in some particular domains it looks like the accuracy increases,
so it could depend on the use case.

On the other hand, there could be some more recent studies that could yield
better results, but that would need some more investigation.

There are also some other issues such as lack of direct multi-lingual
support from WordNet, missing sense definitions etc.
We were also still looking for a better source of sense definitions back
then.
In any case, I believe it would be better to have higher performance before
putting this in the official distribution, however that highly depends on
the team.
Otherwise, different parts of the code just need some simple refactoring as
well.

Best,

Anthony

[1] : M. Lesk, Automatic sense disambiguation using machine readable
dictionaries
[2] : https://www.comp.nus.edu.sg/~nght/pubs/ims.pdf
[3] : http://alt.qcri.org/semeval2015/task13/index.php?id=results

On Wed, Feb 21, 2018 at 5:26 AM, Cristian Petroaca <
cristian.petro...@gmail.com> wrote:

> Hi Anthony,
>
> I'd be interested to discuss this further.
> What are the wsd methods used? Any links to papers?
> How does the module perform when being evaluated against Senseval?
>
> How much work do you think it's necessary in order to have a functioning
> WSD module in the context of OpenNLP?
>
> Thanks,
> Cristian
>
>
>
> On Tue, Feb 20, 2018 at 8:09 AM, Anthony Beylerian <
> anthony.beyler...@gmail.com> wrote:
>
>> Hi Cristian,
>>
>> Thank you for your interest.
>>
>> The WSD module is currently experimental, so as far as I am aware there
>> is no timeline for it.
>>
>> You can find the sandboxed version here:
>> https://github.com/apache/opennlp-sandbox/tree/master/opennlp-wsd
>>
>> I personally didn't have the time to revisit this for a while and there
>> are still some details to work out.
>> But if you are really interested, you are welcome to discuss and
>> contribute.
>> I will assist as much as possible.
>>
>> Best,
>>
>> Anthony
>>
>> On Sun, Feb 18, 2018 at 5:52 AM, Cristian Petroaca <
>> cristian.petro...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm interested in word sense disambiguation (particularly based on
>>> Wordnet). I noticed that the latest OpenNLP version doesn't have any but
>>> I
>>> remember that a couple of years ago there was somebody working on
>>> implementing it. Why isn't it in the official OpenNLP jar? Is there a
>>> timeline for adding it?
>>>
>>> Thanks,
>>> Cristian
>>>
>>
>>
>


Re: Word sense disambiguation

2018-02-19 Thread Anthony Beylerian
Hi Cristian,

Thank you for your interest.

The WSD module is currently experimental, so as far as I am aware there is
no timeline for it.

You can find the sandboxed version here:
https://github.com/apache/opennlp-sandbox/tree/master/opennlp-wsd

I personally didn't have the time to revisit this for a while and there are
still some details to work out.
But if you are really interested, you are welcome to discuss and contribute.
I will assist as much as possible.

Best,

Anthony

On Sun, Feb 18, 2018 at 5:52 AM, Cristian Petroaca <
cristian.petro...@gmail.com> wrote:

> Hi,
>
> I'm interested in word sense disambiguation (particularly based on
> Wordnet). I noticed that the latest OpenNLP version doesn't have any but I
> remember that a couple of years ago there was somebody working on
> implementing it. Why isn't it in the official OpenNLP jar? Is there a
> timeline for adding it?
>
> Thanks,
> Cristian
>


Re: [VOTE] Migrate our main repositories to GitHub

2017-06-27 Thread Anthony Beylerian
+1

On Tue, Jun 27, 2017 at 10:45 PM, Dan Russ  wrote:

> +1
>
> > On Jun 27, 2017, at 9:28 AM, William Colen 
> wrote:
> >
> > +1
> >
> >
> > 2017-06-27 9:35 GMT-03:00 Suneel Marthi :
> >
> >> +1
> >>
> >> मेरे iPhone से प्रेषित
> >>
> >> २७/०६/२०१७ को पू ८:२२ पर Jeff Zemerick  ने लिखा :
> >>
> >>> +1
> >>>
> >>> On Tue, Jun 27, 2017 at 6:53 AM, Rodrigo Agerri  >
> >>> wrote:
> >>>
>  +1
> 
>  R
> 
> > On Tue, Jun 27, 2017 at 12:46 PM, Mark G 
> >> wrote:
> >
> > +1
> >
> > Sent from my iPhone
> >
> >> On Jun 27, 2017, at 6:30 AM, Joern Kottmann 
>  wrote:
> >>
> >> +1
> >>
> >> Jörn
> >>
> >>> On Tue, Jun 27, 2017 at 12:30 PM, Joern Kottmann <
> kottm...@gmail.com
> >>>
> > wrote:
> >>> Hello all,
> >>>
> >>> lets decide here if we want to move our main repository, currently
> >>> hosted at Apache to GitHub instead. This will make our process a
> bit
> >>> easier because we can eliminate one remote from our workflow.
> >>>
> >>>   [ ] +1 Migrate all repositories to GitHub
> >>>   [ ] -1 Do not migrate,  because...
> >>>
> >>> Thanks,
> >>> Jörn
> >
> 
> >>
>
>


Re: Fw: ApacheCon Europe 2016: Talk accepted!

2016-10-03 Thread Anthony Beylerian
Great!

Thank you for sharing.

For those who cannot attend, will slides/video be available?

Anthony

On Thu, Sep 29, 2016 at 12:14 AM, Tommaso Teofili  wrote:

> Very interesting !
> Thanks for letting us know.
>
> Tommaso
> Il giorno mer 28 set 2016 alle 17:05 Boris Galitsky  >
> ha scritto:
>
> >
> > Hello
> >
> >  Just wanted to share I will be talking on how to do deep text analysis
> > including discourse trees on top of OpenNLP
> >
> > Regards
> > Boris
> >
> > ---
> >
> >
> >
> >
> > Hi Boris Galitsky,
> > We are pleased to tell you that your talk, "A Deep Text Analysis System
> > Based on OpenNLP", has been accepted
> > for ApacheCon Europe 2016.
> >
> > Please confirm that you are still able/willing to speak at this event.
> >
> > With regards,
> > The team behind ApacheCon Europe 2016
> > rbo...@apache.org
> >
>


Re: Access to Git

2016-09-14 Thread Anthony Beylerian
Hello,

Concerning the workflow, how about using Gitflow? [1]

Advantages are:

- keeps a clean master branch, work is on the develop branch
- good for multiple (historical) versions
- good integration with sourcetree

Please consider.

Thanks,

Anthony

[1] :
https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow





On Thu, Sep 15, 2016 at 8:34 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Also please look at: http://wiki.apache.org/tika/UsingGit for a guide on
> how
> to migrate your OpenNLP SVN to Git..
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect, Instrument Software and Science Data Systems Section (398)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
> On 9/14/16, 4:12 PM, "Joern Kottmann"  wrote:
>
> Sorry, it took me a little to figure this out.
>
> This link explains how it works:
> https://reference.apache.org/committer/git
>
> The reponame is opennlp, we will soon also have the other repos
> opennlp-addons and opennlp-sandbox.
>
> Jörn
>
> On Fri, Sep 9, 2016 at 10:58 PM, Joern Kottmann 
> wrote:
>
> > Hello, yes you can use it. The add-ons and other things are not
> setup yet
> > as far as I know, have to ping the infra team about it.
> >
> > Please have a look at the issue I posted to see how to access it.
> >
> > I will work on this on Monday.
> >
> > HTH
> > Jörn
> >
> > On Sep 9, 2016 19:10, "William Colen" 
> wrote:
> >
> >> Hello,
> >>
> >> Is the Git repository ready for use?
> >> Do we need to wait for it to develop new stuff?
> >>
> >> Thank you,
> >> William
> >>
> >
>
>
>
>


Re: Migrate to Git?

2016-08-19 Thread Anthony Beylerian
@Jörn @Richard

I believe less bloat is always better for code housekeeping.
For example, although it is small, I think having the site code along with
the toolkit code just seems a bit untidy.

How about we at least separate those two?
It could also be useful to make a more feature rich site in the future.

Actually, the Spark team does that too:

git://git.apache.org/spark.git
git://git.apache.org/spark-website.git


@Madhawa

Did you mean to use branches for the sandboxed projects?

Best,

Anthony

On Fri, Aug 19, 2016 at 7:38 PM, Madhawa Kasun Gunasekara <
madhaw...@gmail.com> wrote:

> we can use branches instead of repositories.
>
> Thanks,
> Madhawa
>
> Madhawa
>
> On Fri, Aug 19, 2016 at 1:54 PM, Joern Kottmann 
> wrote:
>
> > Yes, it would be nice to get the next release out with sentiment
> analysis!
> > It is time for the next release anyway.
> >
> > Jörn
> >
> > On Thu, Aug 18, 2016 at 4:33 PM, Chris Mattmann 
> > wrote:
> >
> > > Fantastic, Joern! I have some SentimentAnalysis stuff to hopefully
> commit
> > > and
> > > get refactored. Hopefully after that’s done we can ship a release soon
> > and
> > > publish to Central.
> > >
> > >
> > >
> > > On 8/18/16, 5:50 AM, "Joern Kottmann"  wrote:
> > >
> > > We made some progress here, the repository is now switched to git.
> > >
> > > Please have a look here:
> > > https://issues.apache.org/jira/browse/INFRA-12209
> > >
> > > And there are couple of things we have to do now:
> > > https://issues.apache.org/jira/browse/OPENNLP-860
> > >
> > > The new repository currently only contains the trunk and not the
> > other
> > > stuff like addons, site and sandbox,
> > > I already commented on the infra issue, we might want to change the
> > > layout
> > > of our repository a bit.
> > > Any thoughts on it?
> > >
> > > The old layout is:
> > > addons
> > > trunk
> > > sandbox
> > > site
> > >
> > > BR,
> > > Jörn
> > >
> > > On Tue, Jul 5, 2016 at 3:11 AM, Mattmann, Chris A (3980) <
> > > chris.a.mattm...@jpl.nasa.gov> wrote:
> > >
> > > > Hi Jörn,
> > > >
> > > > #3 is a mirror on Github of our writeable Git repo from #1. Users
> > > > can submit PRs to it, and then it will flow through to dev list
> in
> > > > the form of an email that links to information that we can use to
> > > > easily merge into our writeable ASF repo. Once merged, it will
> sync
> > > > out to Github and close the PR.
> > > >
> > > > HTH!
> > > >
> > > > Cheers,
> > > > Chris
> > > >
> > > > 
> ++
> > > > Chris Mattmann, Ph.D.
> > > > Chief Architect
> > > > Instrument Software and Science Data Systems Section (398)
> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > Office: 168-519, Mailstop: 168-527
> > > > Email: chris.a.mattm...@nasa.gov
> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > > > 
> ++
> > > > Director, Information Retrieval and Data Science Group (IRDS)
> > > > Adjunct Associate Professor, Computer Science Department
> > > > University of Southern California, Los Angeles, CA 90089 USA
> > > > WWW: http://irds.usc.edu/
> > > > 
> ++
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 7/4/16, 1:23 PM, "Joern Kottmann"  wrote:
> > > >
> > > > >Can you explain 3, is that a writable mirror at Github?
> > > > >
> > > > >Jörn
> > > > >
> > > > >On Mon, 2016-07-04 at 15:35 +, Mattmann, Chris A (3980)
> wrote:
> > > > >> My +1 as well..I would suggest, specifically:
> > > > >>
> > > > >> 1. Use git-wp
> > > > >> 2. Borrow and adapt this guide which suggests how to do it
> > > > >> (i’m happy to adapt)
> > > > >> http://wiki.apache.org/tika/UsingGit
> > > > >> 3. Turn on writeable git wp mirror’ing to apache/opennlp
> > > > >>
> > > > >> Cheers,
> > > > >> Chris
> > > > >>
> > > > >> 
> > > ++
> > > > >> Chris Mattmann, Ph.D.
> > > > >> Chief Architect
> > > > >> Instrument Software and Science Data Systems Section (398)
> > > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > >> Office: 168-519, Mailstop: 168-527
> > > > >> Email: chris.a.mattm...@nasa.gov
> > > > >> WWW:  http://sunset.usc.edu/~mattmann/
> > > > >> 
> > > ++
> > > > >> Director, Information Retrieval and Data Science Group (IRDS)
> > > > >> Adjunct Associate Professor, Computer Science Department
> > > > >> 

Re: Migrate to Git?

2016-08-19 Thread Anthony Beylerian
+1 for separate repositories.

Since they will be under the Apache Github Organization, it will also be
neater to browse them like this:

https://github.com/apache?query=opennlp

I recommend we keep the repository names starting with opennlp-

For example :

https://github.com/apache?query=hadoop

What do you think?

Best,

Anthony


On Fri, Aug 19, 2016 at 6:32 PM, Aliaksandr Autayeu 
wrote:

> >
> > Why do you think it is better?
> >
> In general, separating apples from oranges. In practice, not having to go
> through irrelevant stuff while reading, searching, refactoring. Less stuff
> to clone for build automation. Smaller repos to clone in general.
>
> And you still can do all the above by cloning 4 repos into the same
> directory and setting up a single project in your favorite IDE, emulating
> current structure. But at least nothing forces you to do that as single
> repo forces you to.
>
> However, the above might be subjective. In this case commitocracy it is to
> decide.
>


Re: DeepLearning4J as a ML for OpenNLP

2016-07-02 Thread Anthony Beylerian
@William

I think what you meant previously by feature2vec would be to deep-learn
with any discrete state, not just with words, am I right?
Extra side-information could possibly help improve some results, but this
would make things overly complicated in my opinion.

@Boris,

Thank you very much, I see what you mean, yes they do complement each other
in that sense.

>> Do you have a particular problem in mind?

No particular problem, but it was very nice of you to clarify what you
meant by parse tree based approaches.
On the other hand, I am not currently aware of any studies comparing both
doc2vec and discourse trees for particular problems.
But it will be useful to know what to use in each case, since we are
considering deep learning support, the application areas you mentioned are
also quite interesting.

Otherwise, as you may already know there are a couple of projects currently
in progress for the toolkit: sentiment analysis and author profiling.
I think it would be good to use deep learning with these tools (as well as
others).

>> I can share code on git / papers on the above.

Yes I would love to check those out.

>> I am not sure it is a direction for openNLP?

I think Jörn answered that, it would be great to have the library offer
even more tools, using dl4j would be a nice to have (since it also offers
different neural net classifiers).

@Jörn

We could try looking into some existing publications about training models
(if any), unless someone can point us in the right direction, that would
really help.

Otherwise, although we can use other classifiers, dl4j team also has a page
for some recommended neural nets to use for classification in the next step:
http://deeplearning4j.org/neuralnetworktable

Best,

Anthony




On Fri, Jul 1, 2016 at 10:12 PM, Joern Kottmann  wrote:

> Hello,
>
> the people from deeplearning4j are rather nice and I discussed with them
> for a while how
> it can be used for OpenNLP. The state back then was that they don't
> properly support the
> sparse feature vectors we use in OpenNLP today. Instead we would need to
> use word embeddings.
> In the end I never tried it out but I think it might not be very difficult
> to get everything wired together,
> the most difficult part is probably to find a deep learning model setup
> which works well.
>
> Jörn
>
> On Tue, Jun 28, 2016 at 11:23 PM, William Colen 
> wrote:
>
> > Hi,
> >
> > Do you think it would be possible to implement a ML based on DL4J?
> >
> > http://deeplearning4j.org/
> >
> > Thank you
> > William
> >
>


Re: DeepLearning4J as a ML for OpenNLP

2016-06-29 Thread Anthony Beylerian
Hi Boris,

Thank you very much for sharing your experience with us!
Is it possible to ask you for more information?

I have only just recently used d4lj with some introductory material,
however I have also felt doc2vec could also be quite useful, although my
understanding of it is still limited.

My current understanding is that doc2vec as an extension of word2vec, can
capture a more generalized context (the document) instead of just focusing
on the context of a single word, in order to provide features useful to
classify that document.

The advantage would be to better capture latent information that exist in
the document (such as the order of words), instead of just averaging word
vectors, or through other approaches on the document level (would love some
feedback on this)

The generalizations could hurt the classification performance in some
tasks, but seem to be more useful when the target documents are larger.

It could also be possible to chose the "document" to be a single word as
well, reducing the underlying matrix to an array, does that make sense?

Therefore, we could also use document based vectors for mid to high-layer
tasks (doc cat, sentiment, profile etc..). What do you think?

It would be fantastic to clarify, I believe that would also motivate more
people to pitch in and better assist with this.

Thanks,

Anthony
Hi William


I have never heard of Features2Vec.

I think for low-level tasks, pre-linguistic tasks such as text
classification where we don't want to build models and have a one-fits-all
solution, Word2Vec works well. I used it in industrial environment for text
classification, some information extraction and content generation tasks.
So I think it should also work for low-level OpenNLP tasks.


Regards

Boris



From: William Colen <william.co...@gmail.com>
Sent: Wednesday, June 29, 2016 4:43:25 AM
To: dev@opennlp.apache.org
Subject: Re: DeepLearning4J as a ML for OpenNLP

Thank you, Boris. I am new to DeepLearning, so I have no idea the issues we
would face. I was wondering if we can use Features2Vec instead of Word2Vec,
does it make any sense?
The idea was to use DL in low level NLP tasks where we don't have parse
trees yet.


2016-06-29 6:34 GMT-03:00 Boris Galitsky <bgalit...@hotmail.com>:

> Hi guys
>
>   I should mention how we used DeepLearning4J for the OpenNLP.Similarity
> project at
>
> https://github.com/bgalitsky/relevance-based-on-parse-trees
>
>
> The main question is how word2vec models and linguistic information such
> as part trees complement each other. In a word2vec approach any two words
> can be compared. The weakness here is that when learning is based on
> computing a distance between totally unrelated words like 'cat' and 'fly'
> can be meaningless, uninformative and can corrupt a learning model.
>
>
> In OpenNLP.Similarity component similarity is defined  in terms of parse
> trees. When word2vec is applied on top of parse trees and not as a
> bag-of-words, we only compute the distance between the words with the same
> semantic role, so the model becomes more accurate.
>
>
> There's a paper on the way which does the assessment of relevance
> improvent for
>
>
> word2vec (bag-of-words) [traditional] vs word2vec (parse-trees)
>
>
> Regards
>
> Boris
>
> [https://avatars3.githubusercontent.com/u/1051120?v=3=400]<
> https://github.com/bgalitsky/relevance-based-on-parse-trees>
>
> bgalitsky/relevance-based-on-parse-trees<
> https://github.com/bgalitsky/relevance-based-on-parse-trees>
> github.com
> Automatically exported from
> code.google.com/p/relevance-based-on-parse-trees
>
>
>
>
> 
> From: Anthony Beylerian <anthony.beyler...@gmail.com>
> Sent: Wednesday, June 29, 2016 2:13:38 AM
> To: dev@opennlp.apache.org
> Subject: Re: DeepLearning4J as a ML for OpenNLP
>
> +1 would be willing to help out when possible
>


Re: DeepLearning4J as a ML for OpenNLP

2016-06-29 Thread Anthony Beylerian
There's also Doc2vec ::

http://deeplearning4j.org/doc2vec.html

Which could work as well.

On Wed, Jun 29, 2016 at 8:43 PM, William Colen <william.co...@gmail.com>
wrote:

> Thank you, Boris. I am new to DeepLearning, so I have no idea the issues we
> would face. I was wondering if we can use Features2Vec instead of Word2Vec,
> does it make any sense?
> The idea was to use DL in low level NLP tasks where we don't have parse
> trees yet.
>
>
> 2016-06-29 6:34 GMT-03:00 Boris Galitsky <bgalit...@hotmail.com>:
>
> > Hi guys
> >
> >   I should mention how we used DeepLearning4J for the OpenNLP.Similarity
> > project at
> >
> > https://github.com/bgalitsky/relevance-based-on-parse-trees
> >
> >
> > The main question is how word2vec models and linguistic information such
> > as part trees complement each other. In a word2vec approach any two words
> > can be compared. The weakness here is that when learning is based on
> > computing a distance between totally unrelated words like 'cat' and 'fly'
> > can be meaningless, uninformative and can corrupt a learning model.
> >
> >
> > In OpenNLP.Similarity component similarity is defined  in terms of parse
> > trees. When word2vec is applied on top of parse trees and not as a
> > bag-of-words, we only compute the distance between the words with the
> same
> > semantic role, so the model becomes more accurate.
> >
> >
> > There's a paper on the way which does the assessment of relevance
> > improvent for
> >
> >
> > word2vec (bag-of-words) [traditional] vs word2vec (parse-trees)
> >
> >
> > Regards
> >
> > Boris
> >
> > [https://avatars3.githubusercontent.com/u/1051120?v=3=400]<
> > https://github.com/bgalitsky/relevance-based-on-parse-trees>
> >
> > bgalitsky/relevance-based-on-parse-trees<
> > https://github.com/bgalitsky/relevance-based-on-parse-trees>
> > github.com
> > Automatically exported from
> > code.google.com/p/relevance-based-on-parse-trees
> >
> >
> >
> >
> > 
> > From: Anthony Beylerian <anthony.beyler...@gmail.com>
> > Sent: Wednesday, June 29, 2016 2:13:38 AM
> > To: dev@opennlp.apache.org
> > Subject: Re: DeepLearning4J as a ML for OpenNLP
> >
> > +1 would be willing to help out when possible
> >
>


Re: Performances of OpenNLP tools

2016-06-29 Thread Anthony Beylerian
How about we keep track of the sets used for performance evaluation and
results in this doc for now:

https://docs.google.com/spreadsheets/d/15c0-u61HNWfQxiDSGjk49M1uBknIfb-LkbP4BDWTB5w/edit?usp=sharing

Will try to take a better look at OntoNotes and what to use from it.
Otherwise, if anyone would like to suggest proper data-sets for testing
each component that would be really helpful

Anthony

On Thu, Jun 23, 2016 at 12:18 AM, Joern Kottmann <kottm...@gmail.com> wrote:

> It would be nice to get MASC support into the OpenNLP formats package.
>
> Jörn
>
> On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge <jasonbaldri...@gmail.com
> >
> wrote:
>
> > Jörn is absolutely right about that. Another good source of training data
> > is MASC. I've got some instructions for training models with MASC here:
> >
> > https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
> >
> > Chalk (now defunct) provided a Scala wrapper around OpenNLP
> functionality,
> > so the instructions there should make it fairly straightforward to adapt
> > MASC data to OpenNLP.
> >
> > -Jason
> >
> > On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <kottm...@gmail.com> wrote:
> >
> > > There are some research papers which study and compare the performance
> of
> > > NLP toolkits, but be careful often they don't train the NLP tools on
> the
> > > same data and the training data makes a big difference on the
> > performance.
> > >
> > > Jörn
> > >
> > > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <kottm...@gmail.com>
> > > wrote:
> > >
> > > > Just don't use the very old existing models, to get good results you
> > have
> > > > to train on your own data, especially if the domain of the data used
> > for
> > > > training and the data which should be processed doesn't match. The
> old
> > > > models are trained on 90s news, those don't work well on todays news
> > and
> > > > probably much worse on tweets.
> > > >
> > > > OntoNots is a good place to start if the goal is to process news.
> > OpenNLP
> > > > comes with build-in support to train models from OntoNotes.
> > > >
> > > > Jörn
> > > >
> > > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > > > chris.a.mattm...@jpl.nasa.gov> wrote:
> > > >
> > > >> This sounds like a fantastic idea.
> > > >>
> > > >> ++
> > > >> Chris Mattmann, Ph.D.
> > > >> Chief Architect
> > > >> Instrument Software and Science Data Systems Section (398)
> > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > >> Office: 168-519, Mailstop: 168-527
> > > >> Email: chris.a.mattm...@nasa.gov
> > > >> WWW:  http://sunset.usc.edu/~mattmann/
> > > >> ++
> > > >> Director, Information Retrieval and Data Science Group (IRDS)
> > > >> Adjunct Associate Professor, Computer Science Department
> > > >> University of Southern California, Los Angeles, CA 90089 USA
> > > >> WWW: http://irds.usc.edu/
> > > >> ++
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <
> > anthonybeyler...@hotmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >> >+1
> > > >> >
> > > >> >Maybe we could put the results of the evaluator tests for each
> > > component
> > > >> somewhere on a webpage and on every release update them.
> > > >> >This is of course provided there are reasonable data sets for
> testing
> > > >> each component.
> > > >> >What do you think?
> > > >> >
> > > >> >Anthony
> > > >> >
> > > >> >> From: mondher.bouaz...@gmail.com
> > > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> > > >> >> Subject: Re: Performances of OpenNLP tools
> > > >> >> To: dev@opennlp.apache.org
> > >

Re: DeepLearning4J as a ML for OpenNLP

2016-06-29 Thread Anthony Beylerian
+1 would be willing to help out when possible


RE: Performances of OpenNLP tools

2016-06-21 Thread Anthony Beylerian
+1 

Maybe we could put the results of the evaluator tests for each component 
somewhere on a webpage and on every release update them.
This is of course provided there are reasonable data sets for testing each 
component.
What do you think?

Anthony

> From: mondher.bouaz...@gmail.com
> Date: Tue, 21 Jun 2016 15:59:47 +0900
> Subject: Re: Performances of OpenNLP tools
> To: dev@opennlp.apache.org
> 
> Hi,
> 
> Thank you for your replies.
> 
> Please Jeffrey accept once more my apologies for receiving the email twice.
> 
> I also think it would be great to have such studies on the performances of
> OpenNLP.
> 
> I have been looking for this information and checked in many places,
> including obviously google scholar, and I haven't found any serious studies
> or reliable results. Most of the existing ones report the performances of
> outdated releases of OpenNLP, and focus more on the execution time or
> CPU/RAM consumption, etc.
> 
> I think such a comparison will help not only evaluate the overall accuracy,
> but also highlight the issues with the existing models (as a matter of
> fact, the existing models fail to recognize many of the hashtags in tweets:
> the tokenizer splits them into the "#" symbol and a word that the PoS
> tagger also fails to recognize).
> 
> Therefore, building Twitter-based models would also be useful, since many
> of the works in academia / industry are focusing on Twitter data.
> 
> Best regards,
> 
> Mondher
> 
> 
> 
> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge 
> wrote:
> 
> > It would be fantastic to have these numbers. This is an example of
> > something that would be a great contribution by someone trying to
> > contribute to open source and who is maybe just getting into machine
> > learning and natural language processing.
> >
> > For Twitter-ish text, it'd be great to look at models trained and evaluated
> > on the Tweet NLP resources:
> >
> > http://www.cs.cmu.edu/~ark/TweetNLP/
> >
> > And comparing to how their models performed, etc. Also, it's worth looking
> > at spaCy (Python NLP modules) for further comparisons.
> >
> > https://spacy.io/
> >
> > -Jason
> >
> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick 
> > wrote:
> >
> > > I saw the same question on the users list on June 17. At least I thought
> > it
> > > was the same question -- sorry if it wasn't.
> > >
> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> > > chris.a.mattm...@jpl.nasa.gov> wrote:
> > >
> > > > Well, hold on. He sent that mail (as of the time of this mail) 4
> > > > mins previously. Maybe some folks need some time to reply ^_^
> > > >
> > > > ++
> > > > Chris Mattmann, Ph.D.
> > > > Chief Architect
> > > > Instrument Software and Science Data Systems Section (398)
> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > Office: 168-519, Mailstop: 168-527
> > > > Email: chris.a.mattm...@nasa.gov
> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > > > ++
> > > > Director, Information Retrieval and Data Science Group (IRDS)
> > > > Adjunct Associate Professor, Computer Science Department
> > > > University of Southern California, Los Angeles, CA 90089 USA
> > > > WWW: http://irds.usc.edu/
> > > > ++
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick"  wrote:
> > > >
> > > > >Hi Mondher,
> > > > >
> > > > >Since you didn't get any replies I'm guessing no one is aware of any
> > > > >resources related to what you need. Google Scholar is a good place to
> > > look
> > > > >for papers referencing OpenNLP and its methods (in case you haven't
> > > > >searched it already).
> > > > >
> > > > >Jeff
> > > > >
> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> > > > >mondher.bouaz...@gmail.com> wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> Apologies if you received multiple copies of this email. I sent it
> > to
> > > > the
> > > > >> users list a while ago, and haven't had an answer yet.
> > > > >>
> > > > >> I have been looking for a while if there is any relevant work that
> > > > >> performed tests on the OpenNLP tools (in particular the Lemmatizer,
> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy texts such
> > as
> > > > >> Twitter data, etc., and/or compared it to other libraries.
> > > > >>
> > > > >> By performances, I mean accuracy/precision, rather than time of
> > > > execution,
> > > > >> etc.
> > > > >>
> > > > >> If anyone can refer me to a paper or a work done in this context,
> > that
> > > > >> would be of great help.
> > > > >>
> > > > >> Thank you very much.
> > > > >>
> > > > >> Mondher
> > > > >>
> > > >
> > >
> >
  

Re: [jira] [Updated] (TIKA-2000) Author profile parser

2016-06-16 Thread Anthony Beylerian
Hi Indhu,

Thank you very much for the details.

Just to confirm, the age is estimated within a 10 year window, so 65% owes
to having the estimate window overlap the actual age is this correct?

In this case, this makes sense for age, but other cases can't be treated
similarly since the values aren't contiguous but categorical.

I recommend we use an approach similar to the GeoTopicParser where an
OpenNLP classifier is used.

Best,

Anthony

On Wed, Jun 15, 2016 at 12:47 PM, Indhu Kamala Kumar <kamal...@usc.edu>
wrote:

>
> Hi,
>
> We use Linear Regression to predict the Author Age. The model is created
> by selecting the top 'n' bi-grams as the features. Linear regression model
> is applied on generated features where the x axis is a matrix of features
> and y axis is the corresponding age array. The co-efficients, intercepts
> are calculated and the age is predicted. The age is then grouped by adding
> and subtracting 5 years to the predicted age. The model is about 65%
> accurate for large documents. Here is the link to the research paper we
> referred to:
> http://repository.cmu.edu/cgi/viewcontent.cgi?article=1215=lti
>
> Regards,
> Indhu
>
> On Tue, Jun 14, 2016 at 9:56 AM, Madhav Sharan <msha...@usc.edu> wrote:
>
>> Yeah agreed I saw your project and I liked the way you created binary and
>> quad age groups. *Indhu* can share more details on linear regression
>> approach and accuracy. As far as I know it's a bigram model based on top
>> 10k features
>>
>> This is how Tika CLI response looks like -
>>
>> Content-Length: 6954
>> Content-Type: application/xml
>> *Estimated-Author-Age: 23*
>> *Estimated-Author-Age-Range: 18-28*
>> X-Parsed-By: org.apache.tika.parser.CompositeParser
>> X-Parsed-By: org.apache.tika.parser.nlp.classifier.TextFeatureParser
>> resourceName: pom.xml
>>
>> I was thinking to add more meta data fields from different approaches in
>> same response. For example we can add a new field 
>> "*Estimated-Author-Age-Binary-Group"
>> *to this. We can run multiple REST API call in parallel and
>> enable/disable through property file. Basically let user define what all
>> API it wants to run and we can club all the results together through TIKA.
>>
>> Thanks
>>
>> --
>> Madhav Sharan
>>
>>
>> On Tue, Jun 14, 2016 at 12:51 AM, Anthony Beylerian <
>> anthony.beyler...@gmail.com> wrote:
>>
>>> Hi Madhav,
>>>
>>> Thank you for sharing, yes maybe it's possible.
>>>
>>> Although there is overlap, the two approaches are a bit different.
>>>
>>> Do you have some documentation on the performance of the linear
>>> regression approach?
>>>
>>> I'm not sure how well it would perform for gender (binary) and other
>>> attributes.
>>>
>>> Ideally it would be desirable to have a way to capture all traits with
>>> reasonable performance.
>>>
>>> Best,
>>>
>>> Anthony
>>>
>>>
>>> On Tue, Jun 14, 2016 at 8:46 AM, Madhav Sharan <msha...@usc.edu> wrote:
>>>
>>>> Hi Anthony, age prediction part of this enhancement looks very similar
>>>> to
>>>> https://issues.apache.org/jira/browse/TIKA-1988
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_TIKA-2D1988=DQMFaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=DhBa2eLkbd4gAFB01lkNgg=9RmoO3IABeowGsd4to3rmAsNGyj0_JZvKV652Y5Vglw=nKX9E7Bx4P7K2XTDx09XhgeiiOMPspDmT0Adk7GIPfg=>
>>>>
>>>> Do you see any way we can collaborate on this feature? I was thinking to
>>>> build a TextFeatureParser which can parse multiple text based features
>>>> like
>>>> age.
>>>>
>>>> In our project for age prediction we built a classifier using linear
>>>> regression which is available through a REST API ( more details in [0]
>>>> ).
>>>> We can configure multiple such REST APIs in TIKA through property file
>>>> and
>>>> then let the TextFeatureParser collate and present all the results.
>>>>
>>>> Let me know what you think about it. [1] has my code for
>>>> TextFeatureParser,
>>>> I will be giving a PR soon.
>>>>
>>>> CCing Indhu for any questions regarding [0]
>>>>
>>>> [0] https://github.com/USCDataScience/Age-Predictor
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_USCDataScience_Age-2DPredictor=DQMFaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=DhBa2eLkbd4gAFB01lkNgg=9RmoO3IABeowGsd4to3rmAsNGyj0_JZvKV652Y5Vglw=xd4ervXX_i0ZIpOSFgj80D563gcu8x3Vr1EVCE4f_g0=>
>>>> [1] https://github.com/smadha/tika/tree/TIKA-1988
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_smadha_tika_tree_TIKA-2D1988=DQMFaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=DhBa2eLkbd4gAFB01lkNgg=9RmoO3IABeowGsd4to3rmAsNGyj0_JZvKV652Y5Vglw=qYjX6OCUXpDmX8074vxKXpcuF6-ckVuWorr4135QBlw=>
>>>>
>>>>
>>>> --
>>>> Madhav Sharan
>>>>
>>>
>>>
>>
>


Re: Profiler for OpenNLP

2016-06-09 Thread Anthony Beylerian
Hello,

Thank you very much for your interest.

We are planning to implement some of features listed here [1].
However due to the breadth of approaches, any suggestions or hints based on
your experience are of course welcome.

[1] : http://www.ripublication.com/ijaer16/ijaerv11n5_24.pdf

On Wed, Jun 8, 2016 at 8:14 PM, Kostas Perifanos <kostas.perifa...@gmail.com
> wrote:

> Hi,
>
> very interesting, I have done some work in the field and I would like to
> contribute as well
>
> On Wed, Jun 8, 2016 at 4:29 AM, Madhawa Kasun Gunasekara <
> madhaw...@gmail.com> wrote:
>
> > +1
> > I would like to contribute.
> >
> > Thanks,
> > Madhawa
> >
> > Madhawa
> >
> > On Wed, Jun 8, 2016 at 1:26 AM, Tommaso Teofili <
> tommaso.teof...@gmail.com
> > >
> > wrote:
> >
> > > +1 that sounds quite interesting.
> > >
> > > Regards,
> > > Tommaso
> > >
> > > Il giorno mar 7 giu 2016 alle ore 20:03 Mattmann, Chris A (3980) <
> > > chris.a.mattm...@jpl.nasa.gov> ha scritto:
> > >
> > > > We would love to have this part of Apache Tika. You can take a look
> > > > at the existing NER/NLP stuff integrated like in GeoTopicParser as
> > > > an example and yes please file a JIRA issue:
> > > >
> > > > http://issues.apache.org/jira/browse/TIKA
> > > >
> > > > I would be happy to work with you to make it happen.
> > > >
> > > > See: http://github.com/apache/tika/#contributing-via-github
> > > >
> > > > For guidance.
> > > >
> > > > ++
> > > > Chris Mattmann, Ph.D.
> > > > Chief Architect
> > > > Instrument Software and Science Data Systems Section (398)
> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > Office: 168-519, Mailstop: 168-527
> > > > Email: chris.a.mattm...@nasa.gov
> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > > > ++++++++++
> > > > Director, Information Retrieval and Data Science Group (IRDS)
> > > > Adjunct Associate Professor, Computer Science Department
> > > > University of Southern California, Los Angeles, CA 90089 USA
> > > > WWW: http://irds.usc.edu/
> > > > ++
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 6/7/16, 9:36 AM, "Anthony Beylerian" <anthony.beyler...@gmail.com
> >
> > > > wrote:
> > > >
> > > > >Hello,
> > > > >
> > > > >We are currently working on an experimental author profiler that we
> > > think
> > > > >could be added to the toolkit.
> > > > >
> > > > >The profiler aims to detect the gender and age range of an author.
> > > > >Later we hope to add personality aspects such as:
> > > > >[extroverted, stable, agreeable, conscientious]
> > > > >
> > > > >We would like the teams' opinion on the matter.
> > > > >An initial code drop can be found here[1] if someone is willing to
> > > > >contribute/collaborate on it with us please let us know.
> > > > >
> > > > >Thanks!
> > > > >
> > > > >[1] https://github.com/beylerian/profiler
> > > >
> > >
> >
>


Profiler for OpenNLP

2016-06-07 Thread Anthony Beylerian
Hello,

We are currently working on an experimental author profiler that we think
could be added to the toolkit.

The profiler aims to detect the gender and age range of an author.
Later we hope to add personality aspects such as:
[extroverted, stable, agreeable, conscientious]

We would like the teams' opinion on the matter.
An initial code drop can be found here[1] if someone is willing to
contribute/collaborate on it with us please let us know.

Thanks!

[1] https://github.com/beylerian/profiler


Re: Updates on SentimentAnalysisParser

2016-06-04 Thread Anthony Beylerian
Hi Anastasija,

Good work sounds great, I will try to review it when it's available.
Just curious, which approach has been implemented so far ?

Best,

Anthony



On Sat, Jun 4, 2016 at 9:36 PM, Anastasija Mensikova <
mensikova.anastas...@gmail.com> wrote:

> Hello everyone,
>
> I hope you are enjoying the first few days of Summer.
>
> Just some updates on what we've been working on with Chris in terms of the
> SentimentAnalysisParser.
>
> I have recently created a command line tool, fixed some issues and have
> started implementing the Parser for Tika (I am yet to push the working
> version to GitHub, just going through debugging right now). Chris has also
> now given me access to MEMEX (the weapons ads) so I can run my Parser on
> them and perform sentiment analysis classification. I will also now start
> working on visualising our work by using D3.js.
>
> Thank you,
> Anastasija
>


Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-05-08 Thread Anthony Beylerian
d Science Data Systems Section (398)
>> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >Office: 168-519, Mailstop: 168-527
>> >Email: chris.a.mattm...@nasa.gov
>> >WWW:
>> >http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>> >++
>> >Director, Information Retrieval and Data Science Group (IRDS)
>> >Adjunct Associate Professor, Computer Science Department
>> >University of Southern California, Los Angeles, CA 90089 USA
>> >WWW: http://irds.usc.edu/
>> >++
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >On 4/25/16, 11:07 PM, "Anastasija Mensikova" <
>> mensikova.anastas...@gmail.com> wrote:
>> >
>> >>Hi everyone,
>> >>
>> >>
>> >>So is the hangout session tomorrow (Tuesday) at 6:30pm IST (9am EST)
>> confirmed or not?
>> >>
>> >>
>> >>Thank you,
>> >>Anastasija
>> >>
>> >>
>> >>On 25 April 2016 at 15:23, Madhawa Kasun Gunasekara
>> >><madhaw...@gmail.com> wrote:
>> >>
>> >>Hi all,
>> >>
>> >>
>> >>Shall we have the hangout session tomorrow (Tuesday) about 18:30 IST ?
>> >>
>> >>
>> >>Thanks,
>> >>
>> >>Madhawa
>> >>
>> >>
>> >>
>> >>
>> >>Madhawa
>> >>
>> >>
>> >>
>> >>
>> >>On Sun, Apr 24, 2016 at 10:33 PM, Mondher Bouazizi
>> >><mondher.bouaz...@gmail.com> wrote:
>> >>
>> >>Hi,
>> >>
>> >>I am sorry for my late reply.
>> >>
>> >>Given the time difference between Japan and USA, I think I won't be
>> >>available on weekdays. I will be available only on Friday/Saturday
>> morning
>> >>(9-10am EST).
>> >>
>> >>I am not sure if Chris is OK with that, we had our previous meetings on
>> >>Saturday mornings.
>> >>
>> >>Otherwise, please go ahead. I will join as soon as I can.
>> >>
>> >>Thanks.
>> >>
>> >>@Chris: my github ID is mondher-bouazizi
>> >>
>> >>Best regards,
>> >>
>> >>Mondher
>> >>
>> >>On Mon, Apr 25, 2016 at 1:44 AM, Anastasija Mensikova <
>> >>mensikova.anastas...@gmail.com> wrote:
>> >>
>> >>> Hi Anthony,
>> >>>
>> >>> I can make it by Madhawa's proposal too, after 6pm IST on Tuesday
>> (after
>> >>> 8:30am EST). Let me know when exactly!
>> >>>
>> >>> Thank you,
>> >>> Anastasija
>> >>>
>> >>> On 24 April 2016 at 03:02, Anthony Beylerian <
>> anthony.beyler...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> Hi Anastasija,
>> >>>>
>> >>>> I'm not available by those times (00-07 JST).  I could make it by
>> >>>> Madhawa's proposal, but otherwise please go ahead, we may discuss
>> some
>> >>>> other time.
>> >>>>
>> >>>> @Chris: github ID : beylerian
>> >>>>
>> >>>> Best,
>> >>>>
>> >>>> Anthony
>> >>>>
>> >>>>
>> >>>> Please find my github profile
>> >
>> >
>> >>https://github.com/madhawa-gunasekara <
>> https://github.com/madhawa-gunasekara>
>> >>>>
>> >>>> Madhawa
>> >>>>
>> >>>> On Sun, Apr 24, 2016 at 12:13 AM, Madhawa Kasun Gunasekara <
>> >>>> madhaw...@gmail.com> wrote:
>> >>>>
>> >>>> > Hi Chris,
>> >>>> >
>> >>>> > I'm available on Tuesday & Wednesday after 6.00 pm IST.
>> >>>> >
>> >>>> > Thanks,
>> >>>> > Madhawa
>> >>>> >
>> >>>> > Madhawa
>> >>>> >
>> >>>> > On Sat, Apr 23, 2016 at 11:38 PM, Anastasija Mensikova <
>> >>>> > mensikova.anastas...@gmail.com> wrote:
>> >

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-27 Thread Anthony Beylerian
Hi Rodrigo,

Thank you for sharing, SemEval is a good reference.
An aspect based classifier/quantifier would be nice to have in the kit.

In the google doc we have referred to aspects as "topics".
We have discussed making the initial functionality to work regardless of
the entities in the text.
However, it was also planned to survey aspect based approaches and SemEval
is a good place to start.

I will try to update the idea/task document after surveying some of the
SemEval results.
Please feel free to also update it.

Anthony

On Wed, Apr 27, 2016 at 4:23 AM, Rodrigo Agerri <rodrigo.age...@ehu.eus>
wrote:

> Hello,
>
> Everything looks very interesting.  Other options are the Aspect Based
> Sentiment Analysis tasks as described in
>
> http://alt.qcri.org/semeval2014/task4/
> http://alt.qcri.org/semeval2015/task12/
> http://alt.qcri.org/semeval2016/task5/
>
> The task is well circumscribed plus data is publicly available, which
> is good to try and make manageable objectives for a GSOC.
>
> Best,
>
> Rodrigo
>
>
>
> On Tue, Apr 26, 2016 at 6:10 PM, Anthony Beylerian
> <anthony.beyler...@gmail.com> wrote:
> > Please check this approach [1] it could be useful to combine
> > a labeled seed set with unlabeled Fisher CallHome.
> > Since it maybe a long read there's a shorter ppt as well [2]
> >
> > [1] link.springer.com/article/10.1023%2FA%3A1007692713085
> > [2] cseweb.ucsd.edu/~atsmith/presentation_final.ppt
> >
> >
> > On Tue, Apr 26, 2016 at 11:36 PM, Joern Kottmann <kottm...@gmail.com>
> wrote:
> >
> >> The Large Movie Review Dataset might be interesting for this as well:
> >> http://ai.stanford.edu/~amaas/data/sentiment/
> >>
> >> Jörn
> >>
> >> On Tue, Apr 26, 2016 at 4:26 PM, Anthony Beylerian <
> >> anthony.beyler...@gmail.com> wrote:
> >>
> >> > sentiment analysis discussion doc :
> >> >
> >> >
> >> >
> >>
> https://docs.google.com/document/d/1Gi59YqtisY4NLaVY3B7CNLMTgCRZm9JEk17kmBmWXqQ/edit?usp=sharing
> >> >
> >> > On Tue, Apr 26, 2016 at 10:56 PM, Mattmann, Chris A (3980) <
> >> > chris.a.mattm...@jpl.nasa.gov> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > Sure here is the link:
> >> > >
> >> > > https://hangouts.google.com/call/a2w5cgdtirf6jgfb4ww5l2l64ee
> >> > >
> >> > > Sorry for the delay.
> >> > >
> >> > > Cheers,
> >> > > Chris
> >> > >
> >> > > ++
> >> > > Chris Mattmann, Ph.D.
> >> > > Chief Architect
> >> > > Instrument Software and Science Data Systems Section (398)
> >> > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> > > Office: 168-519, Mailstop: 168-527
> >> > > Email: chris.a.mattm...@nasa.gov
> >> > > WWW:  http://sunset.usc.edu/~mattmann/
> >> > > ++
> >> > > Director, Information Retrieval and Data Science Group (IRDS)
> >> > > Adjunct Associate Professor, Computer Science Department
> >> > > University of Southern California, Los Angeles, CA 90089 USA
> >> > > WWW: http://irds.usc.edu/
> >> > > ++
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On 4/26/16, 6:48 AM, "Anastasija Mensikova" <
> >> > > mensikova.anastas...@gmail.com> wrote:
> >> > >
> >> > > >Hi everyone,
> >> > > >
> >> > > >
> >> > > >Is the 9:40 ET hangout still happening? I just have to leave soon
> to
> >> go
> >> > > to class.
> >> > > >
> >> > > >
> >> > > >Thank you,
> >> > > >Anastasija
> >> > > >
> >> > > >
> >> > > >On 25 April 2016 at 23:39, Anastasija Mensikova
> >> > > ><mensikova.anastas...@gmail.com> wrote:
> >> > > >
> >> > > >Hi Chris,
> >> > > >
> >> > > >
> >> > > >Yes, that's perfect. I'll be ready by 9:40am.
> >

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-24 Thread Anthony Beylerian
Hi Anastasija,

I'm not available by those times (00-07 JST).  I could make it by Madhawa's
proposal, but otherwise please go ahead, we may discuss some other time.

@Chris: github ID : beylerian

Best,

Anthony


Please find my github profile https://github.com/madhawa-gunasekara

Madhawa

On Sun, Apr 24, 2016 at 12:13 AM, Madhawa Kasun Gunasekara <
madhaw...@gmail.com> wrote:

> Hi Chris,
>
> I'm available on Tuesday & Wednesday after 6.00 pm IST.
>
> Thanks,
> Madhawa
>
> Madhawa
>
> On Sat, Apr 23, 2016 at 11:38 PM, Anastasija Mensikova <
> mensikova.anastas...@gmail.com> wrote:
>
>> Hi Chris,
>>
>> Thank you very much for your email. I'm so excited to work with you!
>>
>> My Github name is amensiko.
>>
>> And yes, next week sounds good! I'm available on: Tuesday at 4:20pm EST,
>> Thursday 11am - 2:30pm and 4:20 - 6pm EST, Friday 11am - 3pm EST.
>>
>> Thank you,
>> Anastasija
>>
>> On 23 April 2016 at 10:21, Mattmann, Chris A (3980) <
>> chris.a.mattm...@jpl.nasa.gov> wrote:
>>
>>> Hi Anastasija,
>>>
>>> Hope you are well. It’s now time to get started on the project.
>>> Monder, Anthony, Madhawa and I have been discussing ideas about
>>> how to proceed with the project and even developing a task list.
>>> Let’s get your tasks input into that list, and also coordinate.
>>>
>>> I also have an action to share some Spanish/English data to try
>>> and do cross lingual sentiment analysis.
>>>
>>> Are you available to chat this week?
>>>
>>> Cheers,
>>> Chris
>>>
>>> ++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattm...@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++
>>> Director, Information Retrieval and Data Science Group (IRDS)
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> WWW: http://irds.usc.edu/
>>> ++
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 4/23/16, 4:49 AM, "Anthony Beylerian" <anthony.beyler...@gmail.com>
>>> wrote:
>>>
>>> >Hello,
>>> >
>>> >Congratulations for being accepted for this year's GSoC.
>>> >Although Mondher and myself will not participate this year as students,
>>> we
>>> >will do our best to help.
>>> >We are currently busy with academic research, but will join the efforts
>>> >when possible.
>>> >Otherwise, for any discussion concerning the proposed approaches,
please
>>> >let us know.
>>> >
>>> >Best,
>>> >
>>> >On Sat, Apr 23, 2016 at 6:02 PM, Madhawa Kasun Gunasekara <
>>> >madhaw...@gmail.com> wrote:
>>> >
>>> >> Sure we will start working on this.
>>> >>
>>> >> Thanks,
>>> >> Madhawa
>>> >>
>>> >> Madhawa
>>> >>
>>> >> On Sat, Apr 23, 2016 at 1:38 AM, Chris Mattmann <mattm...@apache.org>
>>> >> wrote:
>>> >>
>>> >>> Congrats!
>>> >>>
>>> >>> time to get started team.
>>> >>>
>>>
>>
>>
>


RE: GSOC2016 Sentiment Analysis

2016-03-29 Thread Anthony Beylerian
Dear Chris,

Thank you again for reviewing our proposals. 
We are looking forward to working together on this.

In our previous trials we have used an annotated corpus made through 
crowdflower for testing, and would be happy to share.
Although relatively modest and noisy (~10k training ~8k testing ~20k pattern 
extraction) we believe it was enough to demonstrate encouraging performances.
From our side, we also have a Java implementation that we would like to shape 
up for production, however I'm also comfortable with Python in case we will 
need it.

On the other hand, it sounds intriguing to use a cross-lingual corpus, we would 
love to discuss it.
As for the hangout session, I have just checked with Mondher and the time works 
for us.

Best,

Anthony


> From: chris.a.mattm...@jpl.nasa.gov
> To: mondher.bouaz...@gmail.com; madhaw...@gmail.com
> CC: anthonybeyler...@hotmail.com; dev@opennlp.apache.org; 
> d...@tika.apache.org; ird...@mymaillists.usc.edu
> Subject: Re: GSOC2016 Sentiment Analysis
> Date: Tue, 29 Mar 2016 13:57:11 +
> 
> I like both of your comments Mondher and Madhawa. My team at USC
> has been investigating the use of particular corpuses including
> Fisher Callhome so as to support sentiment analysis. We have been
> writing Java code outside of both OpenNLP and Tika but with the
> goal of integrating them into both. We have a mix of Java and
> Python code that we’d like to bring into both projects.
> 
> I’m reviewing the proposals you wrote now, but would it make sense
> to have a Google hangout this Friday, ~10am PT Los Angeles/time?
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
> 
> 
> 
> 
> 
> -Original Message-
> From: Mondher Bouazizi <mondher.bouaz...@gmail.com>
> Date: Monday, March 28, 2016 at 11:46 PM
> To: Madhawa Kasun Gunasekara <madhaw...@gmail.com>, jpluser
> <chris.a.mattm...@jpl.nasa.gov>
> Cc: Anthony Beylerian <anthonybeyler...@hotmail.com>,
> "dev@opennlp.apache.org" <dev@opennlp.apache.org>, "d...@tika.apache.org"
> <d...@tika.apache.org>, Information and Data Science Group USC List
> <ird...@mymaillists.usc.edu>
> Subject: Re: GSOC2016 Sentiment Analysis
> 
> >Dear Madhawa,
> >
> >
> >Thank you for your interest in the proposals.
> >The current tasks we proposed refer to the classification and
> >quantification regardless of the topic.
> >This can be used in a larger context where the topic is not specified, or
> >not unique, in which case we will need to identify the topic(s).
> >Therefore, a topic detector would be a good idea to implement, in order
> >to complement this.
> >
> >
> >As for the Document Categorizer, it is a general purpose component with
> >basic features (n-gram, bag of words, etc.).
> >
> >It is basically used for the classification of texts into a set of
> >classes defined by the user, whether they are sentiment classes or other.
> >
> >However it doesn't perform well for this purpose.
> >
> >Furthermore, the sentiment analysis component would not just perform the
> >naive classification but also additional tasks (e.g., quantification) and
> >implement more specific and sophisticated approaches.
> >
> >
> >Please share your thoughts.
> >
> >
> >Mondher
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >On Tue, Mar 29, 2016 at 1:51 PM, Madhawa Kasun Gunasekara
> ><madhaw...@gmail.com> wrote:
> >
> >Hi Chris / Antony
> >
> >
> >yes I would like to work on this, This proposal address most of the
> >things in Sentiment analysis,
> >
> >AFAIK most of the people use OpenNLP Document Categorizer for Sentiment
> >Analysis, since there isn't a proper functionality to do sentiment
> >analysis in OpenNLP, This would be great if we can add this feature on
> >OpenNLP project, and also I would like to suggest
> > that we should able to detect the target object of the opinions fr

RE: GSOC2016 Sentiment Analysis

2016-03-28 Thread Anthony Beylerian
Dear Chris,

Thank you for starting the discussion.
We are glad there is an interest in a sentiment analysis component.

My colleague Mondher posted the two JIRA issues related to Sentiment Analysis 
[1][2] as references for our proposals [3][4] for GSoC.
In fact, we have been researching this topic at our university.
We are hoping to participate this year and work on integrating both a sentiment 
classifier and a quantifier for the library.

It would be nice to also have an interface with Tika, maybe we can collaborate ?
We are also looking for mentors, in case someone is willing to support our 
proposals.

Best,

Anthony

[1] https://issues.apache.org/jira/browse/OPENNLP-842[2] 
https://issues.apache.org/jira/browse/OPENNLP-840
[3] 
https://docs.google.com/document/d/1nVnwpmGaOnwHERXr55IClE4V87jUX2sva-mkgWnR8n0/edit?usp=sharing
[4] 
https://docs.google.com/document/d/1x02II9W3rirtuSbx_sY8kOQZSgOp0SIKeIWTCXEOJvo/edit?usp=sharing
> From: chris.a.mattm...@jpl.nasa.gov
> To: nishant@gmail.com
> CC: dev@opennlp.apache.org; madhaw...@gmail.com; hmanj...@usc.edu; 
> kamal...@usc.edu
> Subject: Re: GSOC2016 Sentiment Analysis
> Date: Sun, 27 Mar 2016 19:34:24 +
> 
> No problem - I just wanted to encourage discussion thank you for
> your prompt and courteous replies.
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
  

RE: Word Sense Disambiguator

2015-11-02 Thread Anthony Beylerian
Hello Cristian,

Sorry for the late reply, I finally have a copy of a good corpus for coarse 
testing (OntoNotes).
I will start working again on the component sometime this week.

Best,

Anthony 

> Date: Mon, 12 Oct 2015 15:24:46 +0300
> Subject: Re: Word Sense Disambiguator
> From: cristian.petro...@gmail.com
> To: dev@opennlp.apache.org
> 
> Hi,
> 
> Thanks Anthony for the info.
> Does anybody else know when the WSD component will be merged into trunk and
> possibly cut a release with it?
> 
> Thanks
> 
> On Sat, Sep 19, 2015 at 9:21 AM, Anthony Beylerian <
> anthony.beyler...@gmail.com> wrote:
> 
> > Hey Cristian,
> >
> > Sorry for the late reply, I am currently on summer break but will get back
> > on it in one-two weeks.
> >
> > Can't really say when there will be a new release.
> > This usually involves other components as well and it might take time to
> > vote.
> >
> > However, some things to expect for the WSD component:
> >
> > - Support for the different types of classifiers for the supervised
> > approaches (right now only ME based).
> > - Support for augmenting the general domain training with specific domain
> > information.
> >
> > Best,
> >
> > Anthony
> >
> >
> > On Thu, Sep 17, 2015 at 11:18 PM, Cristian Petroaca <
> > cristian.petro...@gmail.com> wrote:
> >
> > > Hi Anthony,
> > >
> > > Do you know when will the WSD component be available in an OpenNLP
> > release?
> > >
> > > Thanks,
> > > Cristian
> > >
> > > On Thu, Sep 10, 2015 at 10:32 AM, Cristian Petroaca <
> > > cristian.petro...@gmail.com> wrote:
> > >
> > > > Yes, that's what I was looking for.
> > > > Thanks Aliaksandr.
> > > >
> > > > On Wed, Sep 9, 2015 at 9:39 PM, Aliaksandr Autayeu <
> > > aliaksa...@autayeu.com
> > > > > wrote:
> > > >
> > > >> Cristian, the reference you gave basically uses synset offsets - 1740
> > is
> > > >> entity, 1930 is physical entity, etc. However, in YAGO they seems to
> > > have
> > > >> added 1 to those offsets.
> > > >>
> > > >> Synset offset is the fastest way to get into WordNet dictionary,
> > because
> > > >> it
> > > >> is a direct file offset. Offset alone is not enough though, you also
> > > need
> > > >> POS - part of speech. Speed probably is the reason most people access
> > > >> WordNet this way. However, offset is not the best "key", especially
> > for
> > > >> indexing, because offsets change as WordNet evolves. SenseKeys (e.g.
> > > >> bank%1:14:00::
> > > >> and bank%1:21:01::) should be more suitable for indexing.
> > > >>
> > > >> If you're looking to connect with YAGO above, you might do something
> > > along
> > > >> the lines of
> > > >> getWordBySenseKey(sensekey).getSynset().getOffset and then add
> > > >> 1
> > > >> to get the YAGO ids.
> > > >>
> > > >> Aliaksandr
> > > >>
> > > >>
> > > >> On 9 September 2015 at 09:51, Cristian Petroaca <
> > > >> cristian.petro...@gmail.com
> > > >> > wrote:
> > > >>
> > > >> > I am looking for the Sense Id of the word. It has this format here :
> > > >> >
> > > >> >
> > > >>
> > >
> > http://resources.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoWordnetIds.txt
> > > >> >
> > > >> >
> > > >> > On Tue, Sep 8, 2015 at 6:47 PM, Anthony Beylerian <
> > > >> > anthony.beyler...@gmail.com> wrote:
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > Thanks it is still being improved.
> > > >> > >
> > > >> > > I am not sure what you mean by type or database ID.
> > > >> > > Currently the sense source and the sense ID are returned.
> > > >> > >
> > > >> > > For example:
> > > >> > >
> > > >> > > "I went to the bank to deposit money."
> > > >> > > target : bank (index : 4)
> > > >> > > expected output : [WORDNET bank%1:14:00:: 21.6, WORDNET
> > 

Re: Word Sense Disambiguator

2015-09-19 Thread Anthony Beylerian
Hey Cristian,

Sorry for the late reply, I am currently on summer break but will get back
on it in one-two weeks.

Can't really say when there will be a new release.
This usually involves other components as well and it might take time to
vote.

However, some things to expect for the WSD component:

- Support for the different types of classifiers for the supervised
approaches (right now only ME based).
- Support for augmenting the general domain training with specific domain
information.

Best,

Anthony


On Thu, Sep 17, 2015 at 11:18 PM, Cristian Petroaca <
cristian.petro...@gmail.com> wrote:

> Hi Anthony,
>
> Do you know when will the WSD component be available in an OpenNLP release?
>
> Thanks,
> Cristian
>
> On Thu, Sep 10, 2015 at 10:32 AM, Cristian Petroaca <
> cristian.petro...@gmail.com> wrote:
>
> > Yes, that's what I was looking for.
> > Thanks Aliaksandr.
> >
> > On Wed, Sep 9, 2015 at 9:39 PM, Aliaksandr Autayeu <
> aliaksa...@autayeu.com
> > > wrote:
> >
> >> Cristian, the reference you gave basically uses synset offsets - 1740 is
> >> entity, 1930 is physical entity, etc. However, in YAGO they seems to
> have
> >> added 1 to those offsets.
> >>
> >> Synset offset is the fastest way to get into WordNet dictionary, because
> >> it
> >> is a direct file offset. Offset alone is not enough though, you also
> need
> >> POS - part of speech. Speed probably is the reason most people access
> >> WordNet this way. However, offset is not the best "key", especially for
> >> indexing, because offsets change as WordNet evolves. SenseKeys (e.g.
> >> bank%1:14:00::
> >> and bank%1:21:01::) should be more suitable for indexing.
> >>
> >> If you're looking to connect with YAGO above, you might do something
> along
> >> the lines of
> >> getWordBySenseKey(sensekey).getSynset().getOffset and then add
> >> 1
> >> to get the YAGO ids.
> >>
> >> Aliaksandr
> >>
> >>
> >> On 9 September 2015 at 09:51, Cristian Petroaca <
> >> cristian.petro...@gmail.com
> >> > wrote:
> >>
> >> > I am looking for the Sense Id of the word. It has this format here :
> >> >
> >> >
> >>
> http://resources.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoWordnetIds.txt
> >> >
> >> >
> >> > On Tue, Sep 8, 2015 at 6:47 PM, Anthony Beylerian <
> >> > anthony.beyler...@gmail.com> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > Thanks it is still being improved.
> >> > >
> >> > > I am not sure what you mean by type or database ID.
> >> > > Currently the sense source and the sense ID are returned.
> >> > >
> >> > > For example:
> >> > >
> >> > > "I went to the bank to deposit money."
> >> > > target : bank (index : 4)
> >> > > expected output : [WORDNET bank%1:14:00:: 21.6, WORDNET
> bank%1:21:01::
> >> > > 5.8,... etc]
> >> > >
> >> > > Where "bank%1:14:00::" is a SenseKey which you can query WordNet
> with
> >> to
> >> > > give you a sense definition.
> >> > >
> >> > > You can do this using the default dictionary :
> >> > >
> >> > >
> >> >
> >>
> Dictionary.getDefaultResourceInstance().getWordBySenseKey(sensekey).getSynset().getGloss();
> >> > >
> >> > > Hope this is what you are looking for, otherwise please clarify.
> >> > >
> >> > > Anthony Beylerian
> >> > >
> >> > > On Tue, Sep 8, 2015 at 5:34 PM, Cristian Petroaca <
> >> > > cristian.petro...@gmail.com> wrote:
> >> > >
> >> > > > Hi Anthony,
> >> > > >
> >> > > > I had a chance to test the wsd component. That's great work.
> Thanks.
> >> > > > One question, is it possible to return the wordnet type (or
> database
> >> > id)
> >> > > of
> >> > > > the disambiguated word?
> >> > > >
> >> > > > Thanks,
> >> > > > Cristian
> >> > > >
> >> > > > On Fri, Jul 24, 2015 at 1:14 PM, Anthony Beylerian <
> >> > > > anthonybeyler...@hotmail.com> wrote:
> >> > > >
> >> > > > > 

GSoC - WSD component

2015-09-01 Thread Anthony Beylerian
Hello,

We have received the results concerning this year's GSoC.
I am glad we have passed the final evaluation !
I would really like to thank Jörn and Rodrigo's support during the program.
We have enjoyed the challenges and hope to contribute in the future.

Concerning the next steps, we are currently working on the packaging of
what is already available.
Among others, mostly improving the CLI support as well as the unit tests.

Otherwise, there is an interesting approach to enhance the performance
using domain-knowledge information [1].

In the example, OntoNotes and SemCor were used to obtain accuracy close to
90% coarse-grained.

Moreover, with the so-called "Augment" technique, it is possible to combine
specific domain-related information, to the training using general-domain
information.
This is useful, i.e. for the medical field (related to C-Takes) and is
expected to give better performance when domain knowledge is known.

I believe it would be possible to add support for this approach later on
since it only involves augmenting the feature space.

Regards,

Anthony

[1] : http://www.aclweb.org/anthology/D08-1105


RE: Word Sense Disambiguator

2015-07-24 Thread Anthony Beylerian
Hi,

To try out the ongoing implementations, after checking out the sandbox 
repository please try these steps :
1- Create a resource models directory:

- src
  - test
- resources
  + models

2- Include the following pre-trained models and dictionary in that directory:
You can find those here [1] if you like or pre-train your own models.

{
en-token.bin,
en-pos-maxent.bin,
en-sent.bin,en-ner-person.bin,en-lemmatizer.dict
}

As to train the IMS approach you need to include training data like senseval3 
[2]:
For now, please add these folders :
- src
  - test
- resources
   - supervised
 + raw
 + models
 + dictionary

You can find the data files here [2].

3- We included two examples [LeskTester.java] and [IMSTester.java] that you can 
run directly, or make your own tests.

To run a custom test, minimally you need to have a tokenized text or sentence  
for example for Lesk:

  1 String[] words = Loader.getTokenizer().tokenize(sentence);

Chose the index of the word to disambiguate in the token array.

  2 int wordIndex= 6;

Then just create a WSDisambiguator object for example for Lesk :

 3 Lesk lesk = new Lesk();

And you can call the default disambiguation method 

 4 lesk.disambiguate(words,wordIndex);

You will get an array of strings with the following format : 

Lesk : [Source SenseKey Score]   

To read the sense definitions you can use the method :
[opennlp.tools.disambiguator.Constants.printResults]

For using the variations of Lesk, you will need to create and configure a 
parameters object:
  5 LeskParameters leskParams = new LeskParameters();  6 
leskParams.setLeskType(LeskParameters.LESK_TYPE.LESK_BASIC_CTXT_WIN_BF);
  7 leskParams.setWin_b_size(4);  8 leskParams.setDepth(3); 
 9 lesk.setParams(leskParams);

Typically, IMS should perform better than Lesk, since Lesk is a classic method 
but it usually used as a baseline along with the most frequent sense (MFS).
However, we will be testing and adding more techniques.

In any case, please feel free to ask for more details.

Best,

Anthony

[1] : 
https://drive.google.com/folderview?id=0B67Iu3pf6WucfjdYNGhDc3hkTXd1a3FORnNUYzd3dV9YeWlyMFczeHU0SE1TcWwyU1lhZFUusp=sharing
[2] : 
https://drive.google.com/file/d/0ByL0dmKXzHVfSXA3SVZiMnVfOGc/view?usp=sharing
 Date: Fri, 24 Jul 2015 09:54:02 +0200
 Subject: Re: Word Sense Disambiguator
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 
 It would be nice if you could share instructions on how to run it.
 I also would like to give it a try.
 
 Jörn
 
 On Fri, Jul 24, 2015 at 4:54 AM, Anthony Beylerian 
 anthonybeyler...@hotmail.com wrote:
 
  Hello,
  Yes for the moment we are only using WordNet for sense definitions.The
  plan is to complete the package by mid to late August, but if you like you
  can follow up on the progress from the sandbox.
  Best regards,
  Anthony
   Date: Thu, 23 Jul 2015 15:36:57 +0300
   Subject: Word Sense Disambiguator
   From: cristian.petro...@gmail.com
   To: dev@opennlp.apache.org
  
   Hi,
  
   I saw that there are people actively working on a Word Sense
  Disambiguator.
   DO you guys know when will the module be ready to use? Also I assume that
   wordnet is used to define the disambiguated word meaning?
  
   Thanks,
   Cristian
 
 
  

RE: Word Sense Disambiguator

2015-07-23 Thread Anthony Beylerian
Hello,
Yes for the moment we are only using WordNet for sense definitions.The plan is 
to complete the package by mid to late August, but if you like you can follow 
up on the progress from the sandbox.
Best regards,
Anthony
 Date: Thu, 23 Jul 2015 15:36:57 +0300
 Subject: Word Sense Disambiguator
 From: cristian.petro...@gmail.com
 To: dev@opennlp.apache.org
 
 Hi,
 
 I saw that there are people actively working on a Word Sense Disambiguator.
 DO you guys know when will the module be ready to use? Also I assume that
 wordnet is used to define the disambiguated word meaning?
 
 Thanks,
 Cristian
  

RE: WSD - Supervised techniques

2015-07-13 Thread Anthony Beylerian
Dear Rodrigo, 

Thank you for the feedback.

I have added [1][2][3] issues regarding the below.

Concerning the testers (IMSTester etc) they should be in src/test/java/
We can add docs in those to explain how to use each implementation.

Actually, I am using the parser for Senseval3 that Mondher mentionedin 
[LeskEvaluatorTest], the functionality was included in DataExtractor.
I believe it would be best to separate that and have two parser/converter 
classes of the sort :

disambiguator.reader.SemCorReader,
disambiguator.reader.SensevalReader.

That should be clearer, what do you think ?

Anthony

[1]: https://issues.apache.org/jira/browse/OPENNLP-794
[2]: https://issues.apache.org/jira/browse/OPENNLP-795[3]: 
https://issues.apache.org/jira/browse/OPENNLP-796

 From: rage...@apache.org
 Date: Mon, 13 Jul 2015 15:50:00 +0200
 Subject: Re: WSD - Supervised techniques
 To: dev@opennlp.apache.org
 
 Hello,
 
 It has been few public activity these last days. We believe that it is
 very important to step up in two directions wrt what is already commited in 
 svn:
 
 1. Finishing the WSDEvaluator
 2. Provide the classes required to run the WSD tools from the CLI as
 any other component.
 3. Formats: it will be interesting to have at least conversor for the
 most common dataset used for evaluation and training. E.g., semcor and
 senseval-3. You have mentioned that a conversor was already
 implemented but I cannot find it in svn.
 4. Write the documentation so that future users (and other dev members
 here) can test the component.
 
 These comments were general for both unsupervised and supervised WSD.
 Specific to supervised WSD:
 
 5. IMS: you mention in your previous email that the lexical sample
 part is done and that you need to finish the all words IMS
 implementation. If this is the case, a JIRA issue should be open about
 it and make it a priority.
 Incidentally, I cannot find the IMSTester you mentioned in the email.
 
 There is an issue already there for the Evaluator (OPENNLP-790) but I
 think that each of the remaining tasks require their JIRA issues
 (these issue has pending unused imports, variables and other things).
 
 The aim before GSOC ends should be to have the best chance of having the
 WSDcomponent as a good candidate for its integration in the opennlp
 tools. Also, by being able to test it  we can see the actual state of
 the component with respect to performance in the usual datasets.
 
 Can you please create such issues in JIRA and start addressing them 
 separately?
 
 Thanks,
 
 Rodrigo
 
 
 
 On Sun, Jun 28, 2015 at 6:33 PM, Mondher Bouazizi
 mondher.bouaz...@gmail.com wrote:
  Hi everyone,
 
  I finished the first iteration of IMS approach for lexical sample
  disambiguation. Please find the patch uploaded on the jira issue [1]. I
  also created a tester (IMSTester) to run it.
 
  As I mentioned before, the approach is as follows: each time, the module is
  called to disambiguate a word, it first check if the model file for that
  word exists.
 
  1- If the model file exists, it is used to disambiguate the word
 
  2- Otherwise, if the file does not exist, the module checks if the training
  data file for that word exists. If it does, the xml file data will be used
  to train the model and create the model file.
 
  3- If no training data exist, the most frequent sense (mfs) in WordNet is
  returned.
 
  For now I am using the training data I collected from Senseval and Semeval
  websites. However, I am currently checking semcore to use it as a main
  reference.
 
  Yours sincerely,
 
  Mondher
 
  [1] https://issues.apache.org/jira/browse/OPENNLP-757
 
 
 
  On Thu, Jun 25, 2015 at 5:27 AM, Joern Kottmann kottm...@gmail.com wrote:
 
  On Fri, 2015-06-19 at 21:42 +0900, Mondher Bouazizi wrote:
   Hi,
  
   Actually I have finished the implementation of most of the parts of the
  IMS
   approach. I also made a parser for the Senseval-3 data.
  
   However I am currently working on two main points:
  
   - I am trying to figure out how to use the MaxEnt classifier.
  Unfortunately
   there is no enough documentation, so I am trying to see how it is used by
   the other components of OpenNLP. Any recommendation ?
 
  Yes, have a look at the doccat component. It should be easy to
  understand from it how it works. The classifier has to be trained with
  an event (outcome and features) and can then classify a set of features
  in the categories it has seen before as outcome.
 
  Jörn
 
  

RE: GSoC 2015 - WSD Module

2015-06-14 Thread Anthony Beylerian
Hi,
Concerning this point, I would like to ask about BabelNet [1].The advantages of 
[1] is that it integrates WordNet, Wikipedia, Wiktionary, OmegaWiki, Wikidata, 
and Open Multi-WordNet.
Also, the newest SemEval task (which results are just out [2]) relies on it.

Howeover, the 2.5.1 version, which can be used locally, follows a CC BY-NC-SA 
3.0 license [3].I read in [4] that CC-A (Attribution) licenses are acceptable, 
however I am not completely sure if the NC-SA (Non-commercial/ShareAlike) terms 
would be prohibitive since it was mentioned that : 
Many of these licenses have specific attribution terms that need to be adhered 
to, for example CC-A, often by adding them to the NOTICE file. Ensure you are 
doing this when including these works. Note, this list is colloquially known as 
the Category A list.
Would like your thoughts on the matter.
Thanks !
Anthony
[1] : http://babelnet.org/download[2] : 
http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval049.pdf[3] : 
https://creativecommons.org/licenses/by-nc-sa/3.0/
[4] : http://www.apache.org/legal/resolved.html#category-a

 Date: Fri, 5 Jun 2015 15:09:24 +0200
 Subject: Re: GSoC 2015 - WSD Module
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 
 Hello,
 
 yes, wordnet is fine, we already depend on it. I just think that remote
 resources are particular problematic.
 
 For local resources it boils down to their license.
 
 Here is the wordnet one:
 http://wordnet.princeton.edu/wordnet/license/
 
 We might even be able to redistribute this here at Apache, which is really
 nice. To do that we have to check
 with the legal list if they give a green light for it.
 
 You can get more information about licenses and dependencies for Apache
 projects here:
 http://www.apache.org/legal/resolved.html#category-a
 http://www.apache.org/legal/resolved.html#category-b
 http://www.apache.org/legal/resolved.html#category-x
 
 Are the things you have to clean up of the nature that you couldn't do that
 after you send in a patch?
 This could be removal of code which can be released under ASL.
 
 We would like to get you integrated into the way we work here as quickly as
 possible.
 
 That includes:
 - Tasks are planned/tracked via jira (this allows other people to
 comment/follow)
 - We would like to be able to review your code and maybe give some advice
 (commit often, break things down in tasks)
 - Changes or new features are usually discussed a on the dev list (e.g. a
 short write up about the approaches you implemented
   or better plan to implement)
 
 Jörn

  

RE: GSoC 2015 - WSD Module

2015-06-10 Thread Anthony Beylerian
Hi,

I attached an initial patch to OPENNLP-758.
However, we are currently modifying things a bit since many approaches need to 
be supported, but would like your recommendations.
Here are some notes : 

1 - We used extJWNL
2- [WSDisambiguator] is the main interface
3- [Loader] loads the resources required
4- Please check [FeaturesExtractor] for the mentioned methods by Rodrigo.
5- [Lesk] has many variants, we already implemented some, but wondering on the 
preferred way to switch from one to the other:
As of now we use one of them as default, but we thought of either making a 
parameter list to fill or make separate classes for each, or otherwise 
following your preference.
6- The other classes are for convenience.

We will try to patch frequently on the separate issues, following the feedback.

Best regards,

Anthony

 Date: Wed, 10 Jun 2015 11:42:56 +0200
 Subject: Re: GSoC 2015 - WSD Module
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 
 You can attach the patch to one of the issues, you can create an new issue.
 In the end it doesn't matter much, but important is that we make progress
 here and get the initial code into our repository. Subsequent changes can
 then be done in a patch series.
 
 Please try to submit the patch as quickly as possible.
 
 Jörn
 
 On Mon, Jun 8, 2015 at 4:54 PM, Rodrigo Agerri rage...@apache.org wrote:
 
  Hello,
 
  On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
  mondher.bouaz...@gmail.com wrote:
   Dear Rodrigo,
  
   As Anthony mentioned in his previous email, I already started the
   implementation of the IMS approach. The pre-processing and the extraction
   of features have already been finished. Regarding the approach itself, it
   shows some potential according to the author though the features proposed
   are not so many, and are basic.
 
  Hi, yes, the features are not that complex, but it is good to have a
  working system and then if needed the feature set can be
  improved/enriched. As stated in the paper, the IMS approach leverages
  parallel data to obtain state of the art results in both lexical
  sample and all words for senseval 3 and semeval 2007 datasets.
 
  I think it will be nice to have a working system with this algorithm
  as part of the WSD component in OpenNLP (following the API discussion
  previous in this thread) and perform some evaluations to know where
  the system is with respect to state of the art results in those
  datasets. Once this is operative, I think it will be a good moment to
  start discussing additional/better features.
 
   I think the approach itself might be
   enhanced if we add more context specific features from some other
   approaches... (To do that, I need to run many experiments using different
   combinations of features, however, that should not be a problem).
 
  Speaking about the feature sets, in the API google doc I have not seen
  anything about the implementation of the feature extractors, could you
  perhaps provide some extra info (in that same document, for example)
  about that?
 
   But the approach itself requires a linear SVM classifier, and as far as I
   know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use
  libsvm
   ?
 
  I think you can try with a MaxEnt to start with and in the meantime,
  @Jörn has commented sometimes that there is a plugin component in
  OpenNLP to use third-party ML libraries and that he tested it with
  Mallet. Perhaps he could comment on this to use that functionality to
  use SVMs.
 
  
   Regarding the training data, I started collecting some from different
   sources. Most of the existing rich corpora are licensed (Including the
  ones
   mentioned in the paper). The free ones I got for now are from the
  Senseval
   and Semeval websites. However, these are used just to evaluate the
  proposed
   methods in the workshops. Therefore, the words to disambiguate are few in
   number though the training data for each word are rich enough.
  
   In any case, the first tests with Senseval and Semeval collected should
  be
   finished soon. However, I am not sure if there is a rich enough Dataset
  we
   can use to make our model for the WSD module in the OpenNLP library.
   If you have any recommendation, I would be grateful if you can help me on
   this point.
 
  Well, as I said in my previous email, research around word senses is
  moving from WSD towards Supersense tagging where there are recent
  papers and freely available tweet datasets, for example. In any case,
  we can look more into it but in the meantime the Semcor for training
  and senseval/semeval2007 datasets for evaluation should be enough to
  compare your system with the literature.
 
  
   As Jörn mentioned sending an initial patch, should we separate our codes
   and upload two different patches to the two issues we created on the Jira
   (however, this means a lot of redundancy in the code), or shall we keep
   them in one project and upload it? If we opt for the latter case, 

RE: GSoC 2015 - WSD Module

2015-06-03 Thread Anthony Beylerian
Dear Jörn,

Thank you for the reply.===
Yes in the draft WSDisambiguator is the main interface.
===
Yes for the disambiguate method the input is expected to be tokenized, it 
should be an input array.
The second argument is for the token index.  We can also make it into an index 
array to support multiple words.
===
Concerning the resources, we expect two types of resources : local and remote 
resources.

+ For local resources, we have two main types :
1- training models for supervised techniques.
2- knowledge resources 

It could be best to make the packaging using similar OpenNLP models for #1.
As for #2, it will depend on what we want to use,  since the type of 
information depends on the specific technique.

+ As for remote resources ex: [BabelNet], [WordsAPI], etc. we might need to 
have some REST support, for example to retrieve a sense inventory for a certain 
word.Actually, the newest semeval task [Semeval15] will use [BabelNet] for WSD 
and EL (Entity Linking).[BabelNet] has an offline version, but the newest one 
is only available through REST.Also, in case it is needed to use a remote 
resource, AND it typically requires a license, we need to use a license key or 
just use the free quota with no key.

Therefore, we thought of having a [ResourceProvider] as mentioned in the 
[draft]. 
Are there any plans to add an external API connector of the sort or is this 
functionality already possible for extension ?
(I noticed there is a [wikinews_importer] in the sanbox)

But in any case we can always start working only locally as a first step, what 
do you think ?
===
It would be more straightforward to use the algorithm names, so ok why not.
===
Yes we have already started working !
What do we need to push to the sandbox ?
===

Thanks !

Anthony 

[BabelNet] : http://babelnet.org/download
[WordsAPI] : https://www.wordsapi.com/
[Semeval15] : http://alt.qcri.org/semeval2015/task13/
[draft] : 
https://docs.google.com/document/d/10FfAoavKQfQBAWF-frpfltcIPQg6GFrsoD1XmTuGsHM/edit?pli=1


 Subject: Re: GSoC 2015 - WSD Module
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 Date: Mon, 1 Jun 2015 20:30:08 +0200
 
 Hello,
 
 I had a look at your APIs.
 
 Lets start with the WSDisambiguator. Should that be an interface?
 
 // returns the senses ordered by their score (best one first or only 1
 in supervised case)
 String[] disambiguate(String inputText,int inputWordposition);
 
 Shouldn't we have a tokenized input? Or is the inputText a token?
 
 If you have resources you could package those into OpenNLP models and
 use the existing serialization support. Would that work for you?
 
 I think we should have different implementing classes for different
 algorithms rather than grouping that in the Supervised and Unsupervised
 classes. And also use the algorithm / approach name as part of the class
 name.
 
 As far as I understand you already started to work on this. Should we an
 initial code drop into the sandbox, and then work out things from there?
 We strongly prefer to have as much as possible source code editing
 history in our version control system.
 
 Jörn 
  

RE: GSoC 2015 - WSD Module

2015-05-18 Thread Anthony Beylerian
Please excuse the duplicate email, we could not attach the mentioned figure. 
Kindly find it here.
Thank you.

From: anthonybeyler...@hotmail.com
To: dev@opennlp.apache.org
Subject: GSoC 2015 - WSD Module
Date: Mon, 18 May 2015 22:14:43 +0900




Dear all,
In the context of building a Word Sense Disambiguation (WSD) module, after 
doing a survey on WSD techniques, we realized the following points :
- WSD techniques can be split into three sets (supervised, 
unsupervised/knowledge based, hybrid) - WSD is used for different directly 
related objectives such as all-words disambiguation, lexical sample 
disambiguation, multi/cross-lingual approaches etc.- Senseval/Semeval seem to 
be good references to compare different techniques for WSD since many of them 
were tested on the same data (but different one each event).- For the sake of 
making a first solution, we propose to start with supporting the lexical 
sample type of disambiguation, meaning to disambiguate single/limited word(s) 
from an input text.
Therefore, we have decided to collect information about the different 
techniques in the literature (such as  references, performance, parameters 
etc.) in this spreadsheet here.Otherwise we have also collected the results of 
all the senseval/semeval exercises here.(Note that each document has many 
sheets)The collected results, could help decide on which techniques to start 
with as main models for each set of techniques (supervised/unsupervised).
We also propose a general approach for the package in the figure attached.The 
main components are as follows : 
1- The different resources publicly available : WordNet, BabelNet, Wikipedia, 
etc.However, we would also like to allow the users to use their own local 
resources, by maybe defining a type of connector to the resource interface.
2- The resource interface will have the role to provide both a sense inventory 
that the user can query and a knowledge base (such as semantic or syntactic 
info. etc.) that might be used depending on the technique.We might even later 
consider building a local cache for remote services. 
3- The WSD algorithms/techniques themselves that will make use of the resource 
interface to access the resources required.These techniques will be split into 
two main packages as in the left side of the figure :  
Supervised/Unsupervised.The utils package includes common tools used in both 
types of techniques.The details mentioned in each package should be common to 
all implementations of these abstract models.
4- I/O could be processed in different formats (XML/JSON etc) or a simpler 
structure following your recommendations.
If you have any suggestions or recommendations, we would really appreciate 
discussing them and would like your guidance to iterate on this tool-set.
Best regards,

Anthony Beylerian, Mondher Bouazizi 
  

GSoC - Self introduction

2015-05-03 Thread Anthony Beylerian
Dear all,
I am Anthony from Lebanon.  As my colleague Mondher previously mentioned, we 
have been accepted to work on OpenNLP as part of the GSoC 2015 program.
Similarly, I am also working on my Master's at Keio University and am fluent in 
three natural languages (English, Arabic, French) as well as conversational 
level in Japanese. My development background is mainly working on mobile and 
web applications and services, however I am currently also studying Data Mining.
As for the GSoC program, our task is to build a Word Sense Disambiguation 
module, as well as test/example implementations of algorithms for that purpose. 
We hope to build a flexible enough interface so that others can easily test and 
use WSD tools, as well as extend with their own implementations. Officially, we 
will start coding from May 25th up to August 24th. However, we have and are 
still surveying the different proposals in the literature as well as the 
contemporary approaches more closely, and will consequently propose a starting 
point for discussion. 
If it is acceptable, we will later push updates to this list on our progress 
and would really appreciate your input and feedback.
Thank you for your time.
Anthony Beylerian
 
 Dear all,
 
 I am Mondher Bouazizi, from Tunisia. I am a Master's student at Keio
 University in Japan. My academic research is currently focusing on Data
 Mining.
 
 I am glad to inform you that my project proposal has been accepted for the
 Google Summer of Code 2015. The proposal is to add a Word Sense
 Disambiguation (WSD) component to the OpenNLP library.
 
 The objective of WSD is to determine which sense of a word is meant in a
 particular context. Different techniques are proposed in the academic
 literature,but in general they fall mainly into two categories: Supervised
 and Unsupervised. In my work I will design and build a WSD module that
 implement the algorithms of common supervised techniques (e.g. Decision
 Trees, Exemplar-Based or Instance-Based Learning, etc.) On the other hand,
 my colleague Anthony, who got also accepted, will be working on the
 unsupervised ones.
 
 (For more details about the project, please check the issue I created here
 https://issues.apache.org/jira/browse/OPENNLP-757)
 
 I hope the work will make a good contribution to OpenNLP project and to the
 open source community in general.
 
 Yours sincerely,
 
 Mondher Bouazizi
  

RE: Word Sense Disambiguation

2015-02-18 Thread Anthony Beylerian



Thank you for the feedback, I believe that having separate interfaces as 
mentioned for sense provision and disambiguation would be a good idea. 
We will try to survey the techniques and study the library further to propose a 
first structure when possible.
Best,

Anthony
 Subject: Re: Word Sense Disambiguation
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 Date: Mon, 16 Feb 2015 16:48:48 +0100
 
 On Mon, 2015-02-16 at 16:29 +0100, Aliaksandr Autayeu wrote:
  Jörn, to avoid ambiguity in case you addressed me to propose a WSD
  interface. I'd prefer Anthony to come up with a proposal, because he is
  closer to the multiple WSD algorithms that would be nice to include in the
  analysis.
 
 Sorry, for being unclear, yes I addressed Anthony. But everybody who has
 an opinion is very welcome to join the discussion or propose something.
 
 Jörn
 

  

RE: Word Sense Disambiguation

2015-02-13 Thread Anthony Beylerian
Dear devs,

Please try some of the few (simpler) algorithms we implemented to warm up to 
the library :
http://131.113.41.202:8080/opennlp-wsd-demo/nlp-wsd-fe/app/#/home
(will make the source available after cleanup/housekeeping)

We need to define the structure and method signatures, since we will later 
increment with more techniques.

Any suggestions/references for the structure and signatures are welcome.
Best,

Anthony

 Subject: Re: Word Sense Disambiguation
 From: kottm...@gmail.com
 To: dev@opennlp.apache.org
 Date: Mon, 19 Jan 2015 19:10:19 +0100
 
 Hello,
 
 +1 from me to just go ahead and implement the proposed approach. One
 goal of this implementation will be to figure out the interface we want
 to have in OpenNLP for WSD.
 
 We can later extend OpenNLP with more implementations which are taking
 different approaches.
 
 Jörn
 
 On Thu, 2015-01-15 at 16:50 +0900, Anthony Beylerian wrote:
  Hello, 
  
  I'm new here, I previously mentioned to Jörn about my colleagues and myself 
  being interested in helping to implement this component, we were thinking 
  of starting with simple knowledge based approaches, although they do not 
  yield high accuracy, but as a first step they are relatively simple, would 
  like your opinion.
  
  Pei also mentioned cTAKES 
  (http://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-wsd/ currently very 
  exploratory stages here) and YTEX 
  (https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08) is also 
  just exploring WSD for the healthcare domain. It's also currently 
  knowledge/ontology base for now... It would be great to see if OpenNLP 
  supports a general domain WSD
  
  Best, 
  
  Anthony

 
 
  

Word Sense Disambiguation

2015-01-14 Thread Anthony Beylerian
Hello, 

I'm new here, I previously mentioned to Jörn about my colleagues and myself 
being interested in helping to implement this component, we were thinking of 
starting with simple knowledge based approaches, although they do not yield 
high accuracy, but as a first step they are relatively simple, would like your 
opinion.

Pei also mentioned cTAKES 
(http://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-wsd/ currently very 
exploratory stages here) and YTEX 
(https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08) is also just 
exploring WSD for the healthcare domain. It's also currently knowledge/ontology 
base for now... It would be great to see if OpenNLP supports a general domain 
WSD

Best, 

Anthony