Re: OpenNLP Annotations Proposal

Tommaso Teofili Fri, 10 Jun 2011 07:45:32 -0700

Nice! It seems to me Walter/corpus-refiner could be useful with regards to
OpenNLP Annotations [1].
Thanks for the report Olivier :)
Tommaso


[1] : https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

2011/6/10 Olivier Grisel <[email protected]>

> Hi all,
>
> Here is a short report of the Berlin Buzzwords Semantic / NLP
> Hackathon that happened on Wednesday and yesterday at Neofonie and was
> related to this corpus annotation project.
>
> Basically we worked in small 2-3 people groups on various related topics.
>
> Hannes introduced a HTML / JS based tool named Walter to visualize and
> edit named entities and (optionally typed relations between those
> entities). Demo is here:
>
>  http://tmdemo.iais.fraunhofer.de/walter/
>
> Currently Walter walks with UIMA / XMI formatted files as input /
> output using a java servlet deployed on a tomcat server for instance.
> The plan is to adapt it to a corpus annotation validation / refinement
> pattern: feed it with a partially annotated corpus coming from the
> output of a OpenNLP pre-trained on the annotations extracted from
> Wikipedia using https://github.com/ogrisel/pignlproc to bootstrap
> multilingual models.


> We would like to make a fast binary interface with keyboard shortcuts
> to focus one sentence at a time. If the user think that all the
> entities in the sentence are correctly annotated by the model, he/she
> press "space" and the sentence is marked validated and the focus moves
> to the next sentence. If the sentence is complete gibberish he/she can
> discard the sample by pressing "d". The user can also fix individual
> annotations using the mouse interface before validating the corrected
> sample.
>
> Up arrow and down arrow allow the user to move to focus the previous
> and next sentences (infinite AJAX / JSON scrolling over the corpus)
> without validating / discarding the corpus.
>
> When the focus is on a sample. The previous and next samples should be
> displayed before and after with a lower opacity level in read-only
> mode so as to provide the user with contextual information to make the
> right decision on the active sample.
>
> At the end of the session, the user can export all the validated
> samples as a new corpus formatted using the OpenNLP format.
> Unprocessed or explicitly discarded samples are not part of this
> refined version of the annotated corpus.
>
> To implement this we plan to rewrite the server side part of Walter in
> two parts:
>
> 1- a set of JAX-RS resources to convert corpus items + their
> annotations JSON objects on the client to / from OpenNLP NameSamples
> on the server. The first embryon for this part is here:
>
>
> https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner-web
>
> 2- a POJO lib that uses OpenNLP to handle corpus loading, iterative
> validation (with validation / discarding / update + previous and next
> navigation) and serialization of the validated samples to a new
> OpenNLP formatted file that can be fed to train a new generation of
> the model. The work on this part has started here:
>
>
> https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner
>
> Have a look at the test folder to see what's currently implemented. I
> would like to keep this in a separate maven artifact to be able to
> build a simple alternative CLI variant of the refiner interface that
> does not require to start a jetty or tomcat instance  / browser.
>
> For the client side, Hannes started to check that jquery should make
> it easier to implement the ajax callbacks  based on mouse + keyboard
> interaction.
>
> As for the licensing, Hannes told me that his employer should be
> willing to license the relevant parts (non specific to Fraunhoffer)
> Walter under a liberal license (MIT, BSD or ASL) so that it should be
> possible to contribute it to the ASF in the long term.
>
> Another group tested DUALIST: the tool looks really nice for the text
> classification case, less so for the NE detection case (the sample
> view is not very well suited for structured output and it requires to
> build Hearst features by hand, dualist does not do it automatically
> apparently).
>
> It should be possible to turn the Walter refiner into a real active
> learning annotation for structured output (NE and relation extraction)
> if we use the confidence level of the SequentialPerceptron of OpenNLP
> and use the less confident predictions as priority samples for the
> ordering of the sample to processing using the refined after pressing
> "space" or "d". The server could incrementally used the refined sample
> to update it's model and adjust the priority of the next batch of
> samples to refine from time to time as the perceptron algorithm is
> online (supports partial update of the model without restarting from
> scratch).
>
> Another group worked on named entity disambiguation using Solr
> MoreLikeThisHandler and indexes of context occurrences of those
> entities occurring in Wikipedia article. This work will probably be
> integrated in Stanbol directly and should be less interesting for the
> OpenNLP project. Also another group worked on adapting pignlproc to
> their own tools and hadoop infrastructure.
>
> Comments and pull-requests on the corpus-refiner prototype welcome. I
> plan to go on working on this project from time to time. AFAIK Hannes
> won't have time to work on the JS layer in the short term but it
> should be at least possible to have a first version of the command
> line based interface rather quickly.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>

Re: OpenNLP Annotations Proposal

Reply via email to