Hi,
while working on the integration between UIMA and a different text
annotation system we ran into problems with differing offsets between
the two systems.
As it turns out, the other system considers CR+LF (Windows style line
endings) to be two characters, while UIMA sees it as one. Clearly,
CR+LF are two bytes in one-byte-per-character encodings (ASCII, Latin-1,
...) so all systems based on those encodings will see it as two
characters, and I believe it is also represented as two Unicode characters.
In a way it makes sense to consider a "newline" as one character,
independently of how it is represented, so I think the UIMA way is fine.
But is there an overview somewhere how different systems and
programming language handle this, e.g. when extracting substrings, etc.?
Given the mess that this can be it's probably best to normalize all text
at the beginning to only deal with Unicode strings with LF endings,
encoded with UTF-8 when writing to disk or otherwise serializing the data.
It would still be interesting to know how painful this can get when not
normalizing, and e.g. passing data between UIMA (Java), NLTK (Python),
our own C#-based system, etc.
Thanks,
Jens
- CR+LF = 1 character? Jens Grivolla
-