Hi,

while working on the integration between UIMA and a different text annotation system we ran into problems with differing offsets between the two systems.

As it turns out, the other system considers CR+LF (Windows style line endings) to be two characters, while UIMA sees it as one. Clearly, CR+LF are two bytes in one-byte-per-character encodings (ASCII, Latin-1, ...) so all systems based on those encodings will see it as two characters, and I believe it is also represented as two Unicode characters.

In a way it makes sense to consider a "newline" as one character, independently of how it is represented, so I think the UIMA way is fine. But is there an overview somewhere how different systems and programming language handle this, e.g. when extracting substrings, etc.?

Given the mess that this can be it's probably best to normalize all text at the beginning to only deal with Unicode strings with LF endings, encoded with UTF-8 when writing to disk or otherwise serializing the data.

It would still be interesting to know how painful this can get when not normalizing, and e.g. passing data between UIMA (Java), NLTK (Python), our own C#-based system, etc.

Thanks,
Jens

Reply via email to