Nice! It seems to me Walter/corpus-refiner could be useful with regards to OpenNLP Annotations [1]. Thanks for the report Olivier :) Tommaso
[1] : https://cwiki.apache.org/OPENNLP/opennlp-annotations.html 2011/6/10 Olivier Grisel <[email protected]> > Hi all, > > Here is a short report of the Berlin Buzzwords Semantic / NLP > Hackathon that happened on Wednesday and yesterday at Neofonie and was > related to this corpus annotation project. > > Basically we worked in small 2-3 people groups on various related topics. > > Hannes introduced a HTML / JS based tool named Walter to visualize and > edit named entities and (optionally typed relations between those > entities). Demo is here: > > http://tmdemo.iais.fraunhofer.de/walter/ > > Currently Walter walks with UIMA / XMI formatted files as input / > output using a java servlet deployed on a tomcat server for instance. > The plan is to adapt it to a corpus annotation validation / refinement > pattern: feed it with a partially annotated corpus coming from the > output of a OpenNLP pre-trained on the annotations extracted from > Wikipedia using https://github.com/ogrisel/pignlproc to bootstrap > multilingual models. > We would like to make a fast binary interface with keyboard shortcuts > to focus one sentence at a time. If the user think that all the > entities in the sentence are correctly annotated by the model, he/she > press "space" and the sentence is marked validated and the focus moves > to the next sentence. If the sentence is complete gibberish he/she can > discard the sample by pressing "d". The user can also fix individual > annotations using the mouse interface before validating the corrected > sample. > > Up arrow and down arrow allow the user to move to focus the previous > and next sentences (infinite AJAX / JSON scrolling over the corpus) > without validating / discarding the corpus. > > When the focus is on a sample. The previous and next samples should be > displayed before and after with a lower opacity level in read-only > mode so as to provide the user with contextual information to make the > right decision on the active sample. > > At the end of the session, the user can export all the validated > samples as a new corpus formatted using the OpenNLP format. > Unprocessed or explicitly discarded samples are not part of this > refined version of the annotated corpus. > > To implement this we plan to rewrite the server side part of Walter in > two parts: > > 1- a set of JAX-RS resources to convert corpus items + their > annotations JSON objects on the client to / from OpenNLP NameSamples > on the server. The first embryon for this part is here: > > > https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner-web > > 2- a POJO lib that uses OpenNLP to handle corpus loading, iterative > validation (with validation / discarding / update + previous and next > navigation) and serialization of the validated samples to a new > OpenNLP formatted file that can be fed to train a new generation of > the model. The work on this part has started here: > > > https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner > > Have a look at the test folder to see what's currently implemented. I > would like to keep this in a separate maven artifact to be able to > build a simple alternative CLI variant of the refiner interface that > does not require to start a jetty or tomcat instance / browser. > > For the client side, Hannes started to check that jquery should make > it easier to implement the ajax callbacks based on mouse + keyboard > interaction. > > As for the licensing, Hannes told me that his employer should be > willing to license the relevant parts (non specific to Fraunhoffer) > Walter under a liberal license (MIT, BSD or ASL) so that it should be > possible to contribute it to the ASF in the long term. > > Another group tested DUALIST: the tool looks really nice for the text > classification case, less so for the NE detection case (the sample > view is not very well suited for structured output and it requires to > build Hearst features by hand, dualist does not do it automatically > apparently). > > It should be possible to turn the Walter refiner into a real active > learning annotation for structured output (NE and relation extraction) > if we use the confidence level of the SequentialPerceptron of OpenNLP > and use the less confident predictions as priority samples for the > ordering of the sample to processing using the refined after pressing > "space" or "d". The server could incrementally used the refined sample > to update it's model and adjust the priority of the next batch of > samples to refine from time to time as the perceptron algorithm is > online (supports partial update of the model without restarting from > scratch). > > Another group worked on named entity disambiguation using Solr > MoreLikeThisHandler and indexes of context occurrences of those > entities occurring in Wikipedia article. This work will probably be > integrated in Stanbol directly and should be less interesting for the > OpenNLP project. Also another group worked on adapting pignlproc to > their own tools and hadoop infrastructure. > > Comments and pull-requests on the corpus-refiner prototype welcome. I > plan to go on working on this project from time to time. AFAIK Hannes > won't have time to work on the JS layer in the short term but it > should be at least possible to have a first version of the command > line based interface rather quickly. > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel >
