Thanks Remy, that makes sense, but I'm wondering why I get the correct offsets 
in one way of accessing ctakes (the CVD) but the wrong offsets through another 
way (the REST interface)?

I guess for the fake notes I'm fully in favor of saving as plain text/ascii 
files to simplify things. But there are more unicode characters than we can 
write smart rules for and I'd like to make sure unicode strings at least don't 
screw up offsets, even if we don't process them meaningfully. I'm sure we all 
look forward to generation Z doctor's notes that use the thumbs up/down emojis 
for patient prognosis :).

Tim



-----Original Message-----
From: Remy Sanouillet 
<re...@foreseemed.com<mailto:remy%20sanouillet%20%3cre...@foreseemed.com%3e>>
Reply-to: <dev@ctakes.apache.org>
To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
Subject: Re: unicode issues [EXTERNAL]
Date: Thu, 18 Jul 2019 13:37:33 -0700

Hi Tim,

What is happening is that your o'clock contains a smart quote (Unicode U+2019) 
which is encoded as three bytes: 0x6f9980, so you have to take those two extra 
bytes into account when counting offsets. For that particular character, it is 
much easier to just preprocess the text and replace all occurrences with the 
simple apostrophe (ASCII 0x6f). The one on your keyboard. It won't change any 
interpretation and it makes life simpler for everyone downstream. You probably 
will want to deal with all extended Unicode characters like emojis otherwise, 
you will encounter the same offset issues.

Rémy Sanouillet
NLP Engineer
re...@foreseemed.com<mailto:xx...@foreseemed.com>


[cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
ForeSee Medical, Inc.
12555 High Bluff Drive, Suite 100
San Diego, CA 92130

NOTICE: This e-mail message and all attachments transmitted with it are 
intended solely for the use of the addressee and may contain legally privileged 
and confidential information. If the reader of this message is not the intended 
recipient, or an employee or agent responsible for delivering this message to 
the intended recipient, you are hereby notified that any dissemination, 
distribution, copying, or other use of this message or its attachments is 
strictly prohibited. If you have received this message in error, please notify 
the sender immediately by replying to this message and please delete it from 
your computer.


On Thu, Jul 18, 2019 at 1:20 PM Miller, Timothy 
<timothy.mil...@childrens.harvard.edu<mailto:timothy.mil...@childrens.harvard.edu>>
 wrote:
I'm having a weird issue with unicode characters in one of the sample notes 
distributed with ctakes. The sentence is:

The right breast and axilla were sterilely prepped and draped in the usual 
standard fashion.  First the right 1 o’clock position 5 cm from the nipple was 
targeted.  Local anesthesia was obtained with 2% xylocaine.  A small skin 
incision was made.  Under ultrasound guidance from a medial approach, 2 passes 
with a 14 gauge biopsy device were performed and sent to pathology.  A clip was 
placed.

The unicode characters are the right single quotes in "o'clock". If I just put 
it in the CVD everything works fine, e.g. I find the drug "xylocaine" at 
location 203-212 and it's highlighted correctly. However, if I use the REST 
interface and send it using the python requests API, I get back the span 
205:214. If we then grab that span we get the wrong string (offset by 2, so 
something like "locaine. "

Any thoughts on where things might be going wrong for the REST interface? Does 
anyone more knowledgeable than me know how UIMA and cTAKES (and java for that 
matter) normally handle unicode?

Tim


Reply via email to