On 4/20/11 10:58 AM, Jens Grivolla wrote:
Hi,

while working on the integration between UIMA and a different text annotation system we ran into problems with differing offsets between the two systems.

As it turns out, the other system considers CR+LF (Windows style line endings) to be two characters, while UIMA sees it as one.

The string sofa inside a CAS contains 16 bit unicode characters and CR+LF are two unicode characters. So I believe you are mistaken or there is somewhere a bug which turns CR+LF into one char. All offsets are 16 bit unicode offsets, even so one character might need two 16 bit slots. So it might be possible to have an annotation over one character which has a length of two.

Jörn

Reply via email to