2011/6/22 Jörn Kottmann <[email protected]>: > On 6/22/11 10:45 AM, Olivier Grisel wrote: >> >> I wind the UIMA CAS API much more complicated to work with than >> directly working with token-level concepts with the OpenNLP API (i.e. >> with arrays of Span). I haven't add a look at the opennlp-uima >> subproject though: you probably already have tooling and predefined >> type systems that makes interoperability with CAS instance less of a >> pain. > > If you look at annotation tool they usually always give some flexibility to > the user > in terms what kind of annotations they are allowed to add. One thing I > always see is > as soon as they allow more complex annotations the tools and code which > handles to > annotations gets also complex. Have a look at Wordfreak or Gate. > > The CAS might be difficult to use first, but at least it works and is > very well tested. If we create a custom solution we might end up with > a similar complexity anyway. > > We would need to define a type system, but that is something we need > to do anyway independent of which way we implement it. > Maybe we even need to support different type systems for different corpora. > I guess we start with wikipedia based data, but one day we might want to > annotate an email or blog corpus. > > It is an interesting question how the type system should look, since we need > to > track where the annotations come from, and might even want some to be double > checked, > or need to annotate the disagreement of annotators.
Point taken. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
