2011/6/22 Jörn Kottmann <[email protected]>: > I was actually thinking about something similar. Make a small server which > can host XMI CAS files. CASes have the advantage that they take a way lots > of complexity when dealing with a text and annotations. > > Since we have an UIMA Integration OpenNLP can directly be trained with the > CASes, in this case we would make a small server component which can do > the training and then makes the models available via http for example. > > It sounds like that a corpus refiner based web ui could be easily attached > to such a server, and also other tools like the Cas Editor.
I wind the UIMA CAS API much more complicated to work with than directly working with token-level concepts with the OpenNLP API (i.e. with arrays of Span). I haven't add a look at the opennlp-uima subproject though: you probably already have tooling and predefined type systems that makes interoperability with CAS instance less of a pain. > To pre-annotate the articles, we might want to add different types of name > annotations > >> We would like to make a fast binary interface with keyboard shortcuts >> to focus one sentence at a time. If the user think that all the >> entities in the sentence are correctly annotated by the model, he/she >> press "space" and the sentence is marked validated and the focus moves >> to the next sentence. If the sentence is complete gibberish he/she can >> discard the sample by pressing "d". The user can also fix individual >> annotations using the mouse interface before validating the corrected >> sample. >> > Did you discuss to focus on a sentence level? This solution would still > requires > that one annotator goes through the entire document. Maybe we have a user > who wants to fix our wikinews model to detect his entity of choice. Then he > might want to search for sentences which contain it and only label these. Adding a keyword filter / search would be very interesting indeed. > Working on a sentence level also has the advantage that a user can skip a > sentence > which contains an entity he is no sure about how it should be labeled. Yes. > Did you think of using GWT, it might be a very good fit for OpenNLP bacause > all here > have a lot of experience with Java, but maybe not so much experience with > JS? In my experience the GWT abstraction layer adds more complexity than anything else when dealing with lowlevel DOM related concepts such as introducing new "span" elements around a mouse selection. I much prefer debugging in JS using libraries such as JQuery and the firebug debugger even though I am not an experienced JS programmer as well. Furthermore Hannes already had a working code base. > Entity disambiguation would be very nice to have in OpenNLP and I also > need to work on that soon. I will (soon?) include a couple of new scripts in pignlproc to extract occurrence contexts of any kind of entities occurring as wikilinks in Wikipedia dumps to load those in a Solr index. I will let you know when that happens. >> Comments and pull-requests on the corpus-refiner prototype welcome. I >> plan to go on working on this project from time to time. AFAIK Hannes >> won't have time to work on the JS layer in the short term but it >> should be at least possible to have a first version of the command >> line based interface rather quickly. > > Yes, it would be nice to have such a tool, but for OpenNLP Annotations it > must be more focused on crowd sourcing and to work well with a small / > medium size group > of people. I agree. The CLI (& Swing) interface is still useful to validate the workflow concepts though. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
