CR+LF = 1 character?

Jens Grivolla Wed, 20 Apr 2011 01:59:05 -0700

Hi,

while working on the integration between UIMA and a different textannotation system we ran into problems with differing offsets betweenthe two systems.

As it turns out, the other system considers CR+LF (Windows style lineendings) to be two characters, while UIMA sees it as one. Clearly,CR+LF are two bytes in one-byte-per-character encodings (ASCII, Latin-1,...) so all systems based on those encodings will see it as twocharacters, and I believe it is also represented as two Unicode characters.

In a way it makes sense to consider a "newline" as one character,independently of how it is represented, so I think the UIMA way is fine.But is there an overview somewhere how different systems andprogramming language handle this, e.g. when extracting substrings, etc.?

Given the mess that this can be it's probably best to normalize all textat the beginning to only deal with Unicode strings with LF endings,encoded with UTF-8 when writing to disk or otherwise serializing the data.

It would still be interesting to know how painful this can get when notnormalizing, and e.g. passing data between UIMA (Java), NLTK (Python),our own C#-based system, etc.


Thanks,
Jens

CR+LF = 1 character?

Reply via email to