Hi all,

I hope this is the right forum to ask these questions, if not, please
excuse my mistake
and point me to the right forum or source.

I am currently looking into common conventions for how NLP tools represent
certain
common concepts in NLP and I would like to learn if there are standards,
definitions
or conventions for how this is done by UIMA annotators.
I have to admit that I have never really worked with UIMA and that my
knowledge
for how things work with UIMA is limited. Please excuse if I am not using
the right
terminology.

What I am interested in most are the following aspects of representing
NLP-related
concepts with stand-off annotations: I would be extremely glad if somebody
could
give me a rough explanation for how UIMA does this or where in the
documentation
it would be best to look to figure out these things:

* multi-word tokens and their features: I guess that most UIMA processing
pipelines
  will start off with some kind of tokenization where token or word
annotations
  (and their offset ranges) are created. But how are multi-word tokens,
e.g.
  Spanish "vĂ¡monos" = "vamos", "nos" and subsequently properties of the
words
  e.g. POS, lemma ("ir", "nosotros") handled? While the multiword token
itself
  obviously can be associated with an offset range, the words for that
token cannot,
  so how are they annotated?
* how are dependency trees or constituency parses represented? Is there a
specific
  data structure just for each of those or for trees or graphs with
annotations as leaves
  in general?
  Similarly, is there a convention for how to represent coreference chains?
* Is there a convention for how to represent cross-document coreferences?
* Is there a convention for how to represent parallel documents and map
between annotations in
  parallel texts or represent word alignments?
* How are multilingual documents handled, where different parts of the
document, maybe
  even just parts of a sentence switch language and thus may need to get
processed differently?
  Is there a convention for representing such switches in language  and for
how to deal with this?
* How does UIMA handle documents from corpora that only contain tokens
sequences but not any
  whitespace (e.g. original Conll corpora)?

Any information about this or about how to find out about these things in
the documentation
would be extremely welcome.

Many thanks and all the best,
  Johann Petrak

---
http://johann-petrak.github.io/

Reply via email to