On 4/20/11 10:58 AM, Jens Grivolla wrote:
Hi,
while working on the integration between UIMA and a different text
annotation system we ran into problems with differing offsets between
the two systems.
As it turns out, the other system considers CR+LF (Windows style line
endings) to be two characters, while UIMA sees it as one.
The string sofa inside a CAS contains 16 bit unicode characters and
CR+LF are two unicode characters. So I believe you are mistaken
or there is somewhere a bug which turns CR+LF into one char. All offsets
are 16 bit unicode offsets, even so one character might need
two 16 bit slots. So it might be possible to have an annotation over one
character which has a length of two.
Jörn