Re: Coref problem

Jörn Kottmann Thu, 17 Nov 2011 03:44:18 -0800

On 11/17/11 11:59 AM, Aliaksandr Autayeu wrote:

On Thu, Nov 17, 2011 at 11:48 AM, Jörn Kottmann<[email protected]>  wrote:

On 11/17/11 11:32 AM, Aliaksandr Autayeu wrote:

We shouldn't replace JWNL with a newer version,

because we currently don't have the ability to train
or evaluate the coref component.

  +1. Having tests coverage eases many things, refactoring and development

included :)

This is a big issue for us because that also blocks

other changes and updates to the code itself,
e.g. the cleanups Aliaksandr contributed.

What we need here is a plan how we can get the coref component
into a state which makes it possible to develop it in a community.

If we don't find a way to resolve this I think we should move the coref
stuff
to the sandbox and leave it there until we have some training data.

  In my experience doing things like this is almost equal to deleting the

piece of code altogether. On the other side, if there is no developer,
actively using and developing this piece, having corpora, tests, etc,
others might not have enough incentives.

That is already the situation the developer who wrote doesn't support it
anymore.
The only way to get it alive again would be to get the training and
evaluation running.
If we have that, it will be possible to continue to work on it, and people
can start using
it. The code itself is easy to understand and I have a good idea of how it
works.

In the current state it really blocks the development of a few things.

Another option would be label enough wikinews data, so we are able to
train it.

How much exactly is this "enough"? And what's the annotation UI? This also
might be a good option to improve the annotation tools. I might be
interested in pursuing this option (only if the corpus produced will be
under a free license), mainly to learn :) but I would need some help and
supervision.

We are discussing to do a wikinews crowd sourcing project to label
training data for all components in OpenNLP.

I once wrote a proposal to communicate this idea:
https://cwiki.apache.org/**OPENNLP/opennlp-annotations.**html<https://cwiki.apache.org/OPENNLP/opennlp-annotations.html>

Currently we have a first version of the Corpus Server and plugins
for the UIMA Cas Editor (an annotation tool) to access articles in the
Corpus Server and
an OpenNLP Plugin which can help with doing sentence detection,
tokenization and NER (could be extended with support for coref).

These tools are all located in the sandbox.

I am currently using them to run a private annotation project, and
therefore have time to work on them.

I'll get a look at them. I also have my own annotation tools, because I
wasn't happy with what was available out there few years ago and because of
some specifics of the situation which can be exploited to speed up the
annotation, but I would be happy to avoid duplication.

Are your own tools also Open Source? The Cas Editor itself is oftencriticized tonot fit the needs of a particular annotation project, but it can easilybe extendedby a plugin which adds a new eclipse view to show just the informationyou need.

I did this a lot for a few very specific things.

I think UIMA is a great platform for annotation tooling since the UIMACAS (a data structurewhich can be used to contain text and annotations) gives you manyfeatures you needto make such a tool and is easy to adapt to new use cases, e.g. defininga new feature structure

type.

OpenNLP already has training support for UIMA, which I use to train newmodels, these are thenplaced on a http server and the OpenNLP Cas Editor Plugin can loadmodels via http.With this setup you have a closed learning loop and training can be doneevery few minutes.

Back to the coref component, I had a look at extjwnl, one of the issuesI noticed with WordNet isthat there are so many different versions and formats for differentlanguages, which makes ithard to integrate them into coref (which should one day be able tosupport other languages as well).I always though we might need to define our own WordNet data format, sowe can easily handle

WordNets for different languages.

I saw that you worked on this library, maybe that could be something wecan move to OpenNLP

or base some new work on.

Another issue is that we have a zip package which contains all resourcesloaded into a component,but it looks like that this is not so easy with the current WordNetdirectory.


Jörn

Re: Coref problem

Reply via email to