Re: unicode issues

Remy Sanouillet Thu, 18 Jul 2019 13:38:10 -0700

Hi Tim,

What is happening is that your o'clock contains a smart quote (Unicode
U+2019) which is encoded as three bytes: 0x6f9980, so you have to take
those two extra bytes into account when counting offsets. For that
particular character, it is much easier to just preprocess the text and
replace all occurrences with the simple apostrophe (ASCII 0x6f). The one on
your keyboard. It won't change any interpretation and it makes life simpler
for everyone downstream. You probably will want to deal with all extended
Unicode characters like emojis otherwise, you will encounter the same
offset issues.

*Rémy Sanouillet*
NLP Engineer
[email protected] <[email protected]>

[image: cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
ForeSee Medical, Inc.
12555 High Bluff Drive, Suite 100
San Diego, CA 92130

NOTICE: This e-mail message and all attachments transmitted with it are
intended solely for the use of the addressee and may contain legally
privileged and confidential information. If the reader of this message is
not the intended recipient, or an employee or agent responsible for
delivering this message to the intended recipient, you are hereby notified
that any dissemination, distribution, copying, or other use of this message
or its attachments is strictly prohibited. If you have received this
message in error, please notify the sender immediately by replying to this
message and please delete it from your computer.

On Thu, Jul 18, 2019 at 1:20 PM Miller, Timothy <
[email protected]> wrote:

> I'm having a weird issue with unicode characters in one of the sample
> notes distributed with ctakes. The sentence is:
>
> The right breast and axilla were sterilely prepped and draped in the usual
> standard fashion.  First the right 1 o’clock position 5 cm from the nipple
> was targeted.  Local anesthesia was obtained with 2% xylocaine.  A small
> skin incision was made.  Under ultrasound guidance from a medial approach,
> 2 passes with a 14 gauge biopsy device were performed and sent to
> pathology.  A clip was placed.
>
> The unicode characters are the right single quotes in "o'clock". If I just
> put it in the CVD everything works fine, e.g. I find the drug "xylocaine"
> at location 203-212 and it's highlighted correctly. However, if I use the
> REST interface and send it using the python requests API, I get back the
> span 205:214. If we then grab that span we get the wrong string (offset by
> 2, so something like "locaine. "
>
> Any thoughts on where things might be going wrong for the REST interface?
> Does anyone more knowledgeable than me know how UIMA and cTAKES (and java
> for that matter) normally handle unicode?
>
> Tim
>
>

Re: unicode issues

Reply via email to