On 6/10/11 4:12 PM, Olivier Grisel wrote:
Hi all,
Here is a short report of the Berlin Buzzwords Semantic / NLP
Hackathon that happened on Wednesday and yesterday at Neofonie and was
related to this corpus annotation project.
Basically we worked in small 2-3 people groups on various related topics.
Hannes introduced a HTML / JS based tool named Walter to visualize and
edit named entities and (optionally typed relations between those
entities). Demo is here:
http://tmdemo.iais.fraunhofer.de/walter/
Currently Walter walks with UIMA / XMI formatted files as input /
output using a java servlet deployed on a tomcat server for instance.
The plan is to adapt it to a corpus annotation validation / refinement
pattern: feed it with a partially annotated corpus coming from the
output of a OpenNLP pre-trained on the annotations extracted from
Wikipedia using https://github.com/ogrisel/pignlproc to bootstrap
multilingual models.
I was actually thinking about something similar. Make a small server which
can host XMI CAS files. CASes have the advantage that they take a way lots
of complexity when dealing with a text and annotations.
Since we have an UIMA Integration OpenNLP can directly be trained with the
CASes, in this case we would make a small server component which can do
the training and then makes the models available via http for example.
It sounds like that a corpus refiner based web ui could be easily attached
to such a server, and also other tools like the Cas Editor.
To pre-annotate the articles, we might want to add different types of
name annotations
We would like to make a fast binary interface with keyboard shortcuts
to focus one sentence at a time. If the user think that all the
entities in the sentence are correctly annotated by the model, he/she
press "space" and the sentence is marked validated and the focus moves
to the next sentence. If the sentence is complete gibberish he/she can
discard the sample by pressing "d". The user can also fix individual
annotations using the mouse interface before validating the corrected
sample.
Did you discuss to focus on a sentence level? This solution would still
requires
that one annotator goes through the entire document. Maybe we have a user
who wants to fix our wikinews model to detect his entity of choice. Then he
might want to search for sentences which contain it and only label these.
Working on a sentence level also has the advantage that a user can skip
a sentence
which contains an entity he is no sure about how it should be labeled.
Up arrow and down arrow allow the user to move to focus the previous
and next sentences (infinite AJAX / JSON scrolling over the corpus)
without validating / discarding the corpus.
When the focus is on a sample. The previous and next samples should be
displayed before and after with a lower opacity level in read-only
mode so as to provide the user with contextual information to make the
right decision on the active sample.
At the end of the session, the user can export all the validated
samples as a new corpus formatted using the OpenNLP format.
Unprocessed or explicitly discarded samples are not part of this
refined version of the annotated corpus.
To implement this we plan to rewrite the server side part of Walter in
two parts:
1- a set of JAX-RS resources to convert corpus items + their
annotations JSON objects on the client to / from OpenNLP NameSamples
on the server. The first embryon for this part is here:
https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner-web
2- a POJO lib that uses OpenNLP to handle corpus loading, iterative
validation (with validation / discarding / update + previous and next
navigation) and serialization of the validated samples to a new
OpenNLP formatted file that can be fed to train a new generation of
the model. The work on this part has started here:
https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner
Did you think of using GWT, it might be a very good fit for OpenNLP
bacause all here
have a lot of experience with Java, but maybe not so much experience
with JS?
Have a look at the test folder to see what's currently implemented. I
would like to keep this in a separate maven artifact to be able to
build a simple alternative CLI variant of the refiner interface that
does not require to start a jetty or tomcat instance / browser.
For the client side, Hannes started to check that jquery should make
it easier to implement the ajax callbacks based on mouse + keyboard
interaction.
As for the licensing, Hannes told me that his employer should be
willing to license the relevant parts (non specific to Fraunhoffer)
Walter under a liberal license (MIT, BSD or ASL) so that it should be
possible to contribute it to the ASF in the long term.
Another group tested DUALIST: the tool looks really nice for the text
classification case, less so for the NE detection case (the sample
view is not very well suited for structured output and it requires to
build Hearst features by hand, dualist does not do it automatically
apparently).
It should be possible to turn the Walter refiner into a real active
learning annotation for structured output (NE and relation extraction)
if we use the confidence level of the SequentialPerceptron of OpenNLP
and use the less confident predictions as priority samples for the
ordering of the sample to processing using the refined after pressing
"space" or "d". The server could incrementally used the refined sample
to update it's model and adjust the priority of the next batch of
samples to refine from time to time as the perceptron algorithm is
online (supports partial update of the model without restarting from
scratch).
Another group worked on named entity disambiguation using Solr
MoreLikeThisHandler and indexes of context occurrences of those
entities occurring in Wikipedia article. This work will probably be
integrated in Stanbol directly and should be less interesting for the
OpenNLP project. Also another group worked on adapting pignlproc to
their own tools and hadoop infrastructure.
Entity disambiguation would be very nice to have in OpenNLP and I also
need to work on that soon.
Comments and pull-requests on the corpus-refiner prototype welcome. I
plan to go on working on this project from time to time. AFAIK Hannes
won't have time to work on the JS layer in the short term but it
should be at least possible to have a first version of the command
line based interface rather quickly.
Yes, it would be nice to have such a tool, but for OpenNLP Annotations it
must be more focused on crowd sourcing and to work well with a small /
medium size group
of people.
And we of course need to extend it over time to support other kind of
annotations tasks.
Jörn