Re: UIMA Conventions for certain NLP constructs?

Richard Eckart de Castilho Thu, 18 Feb 2021 06:43:10 -0800

Hi Johann,

the UIMA framework itself does not define how such linguistic concepts
are modelled. What it does is offering a framework within which the concepts
can be modeled without prescribing a particular way.


There are various third parties that provide so-called "type systems".
These type systems then specify how certain phenomena are represented.
Usually, these type systems are part of a library of UIMA components
coming from the same third party.

A non-exhaustive list of such third parties is:

- ClearTK
- JCoRe
- cTAKES
- ...
- and DKPro Core (which btw. I am maintaining - please excuse if I limit
examples below to DKPro Core - but going into all type systems would be a
thesis. If you really want, I can point you to mine which has a chapter on
type system design...)

> * multi-word tokens and their features: I guess that most UIMA processing
> pipelines will start off with some kind of tokenization where token or word
> annotations (and their offset ranges) are created. But how are multi-word 
> tokens,
> e.g. Spanish "vámonos" = "vamos", "nos" and subsequently properties of the
> words e.g. POS, lemma ("ir", "nosotros") handled? While the multiword token
> itself obviously can be associated with an offset range, the words for that
> token cannot, so how are they annotated?

Difficult one. DKPro Core offers different ways this could be modeled.
For example, we introduced an "order" feature on the token that allows
multiple tokens to share the same position but defines an order in which the
tokens should be processed:

- https://github.com/dkpro/dkpro-core/issues/1152

Related to that is also the "form" feature because instead of the actual text,
processing should maybe happen on a normalized form of the token:

- https://github.com/dkpro/dkpro-core/issues/953

> * how are dependency trees or constituency parses represented? Is there a
> specific data structure just for each of those or for trees or graphs with
> annotations as leaves in general?

The UIMA CAS is essentially an object graph - a tree can be easily modelled.
However, there is no built-in "tree" type. Here is how DKPro Core does it:

https://dkpro.github.io/dkpro-core/releases/2.1.0/docs/typesystem-reference.html#_syntax

>  Similarly, is there a convention for how to represent coreference chains?

Again, here is how DKPro Core does it. The approach is also used by the
annotation tools INCEpTION and WebAnno which I work on.

https://dkpro.github.io/dkpro-core/releases/2.1.0/docs/typesystem-reference.html#_coreference

> * Is there a convention for how to represent cross-document coreferences?

One way of doing it is via a shared identifier - e.g. an identifer from a 
knowledge resource such as Wikidata (e.g. INCEpTION does that). 

Otherwise, you could come up with your own convention such as combining a
URL with some offset information. The W3C Web Annotation standard has a nice
overview of different way of modelling reference targets.

> * Is there a convention for how to represent parallel documents and map
> between annotations in parallel texts or represent word alignments?

You can use cross-document links.

You can also use the concept of a "view" in UIMA to pair your documents up.
Then you can define a custom feature structure which has pointers to both
views (i.e. both text versions).

> * How are multilingual documents handled, where different parts of the
> document, maybe even just parts of a sentence switch language and thus may 
> need to get
> processed differently?
> Is there a convention for representing such switches in language  and for
> how to deal with this?

That is pretty specific to how a particular library of UIMA processing 
components
is implemented. Some may have a way of specifying a language for portions of a
text. Others might expect the user to break the text up into segments containing
only one language and to process them separately.

> * How does UIMA handle documents from corpora that only contain tokens
> sequences but not any whitespace (e.g. original Conll corpora)?

UIMA itself does not worry about that. The DKPro Core library includes readers
for different kinds of formats including CoNLL-U. The readers try to make a
reasonable choice for whitespace handling depending on the format. E.g. for
most CoNLL formats, we would introduce a space between tokens and a line break
between sentences.

The CoNLL-U format includes metadata as to where spaces should be added and the
DKPro Core ConnluReader tries to honor this information.

> Any information about this or about how to find out about these things in
> the documentation would be extremely welcome.

I'm afraid, you won't find such information on the UIMA website. You'd need to
turn to the websites of the different third-party libraries and to papers the 
authors of these libraries may have written.

Cheers,

-- Richard

Re: UIMA Conventions for certain NLP constructs?

Reply via email to