Re: unicode issues [EXTERNAL]

Jeritt Thayer Tue, 17 Sep 2019 11:04:25 -0700

Hi Sean,

Thanks for the information. I was having a similar issue related to "spans" 
occasionally being off by one when running cTAKES 4.0.0 in two different modes 
- a modified entry point for a Spark cluster and validation of a random subset 
using runClinicalPipeline.sh.


I was looking through the FileTreeReader class and noticed something that I 
think may have contributed to the discrepancies. The following line 
(https://github.com/apache/ctakes/blob/7f6dfd7d20253f88c25bea2fdde5cf22b004b63d/ctakes-core/src/main/java/org/apache/ctakes/core/cr/FileTreeReader.java#L243)
 sets a buffer to 8192, which will read in the first 8192 bytes. At that point, 
this first byte array is converted into a string.

What I noticed for some of our documents is that the last position in the byte 
array would occur in the middle of a multiple byte character. As a result, the 
method tries to convert the first part of the character’s bytes to a string on 
the first loop, and then tries to convert the second portion on the second 
iteration. This results in an additional character, which I think is ultimately 
causing our "span" discrepancy.

Does my thought process make sense with your understanding of the code?

Thanks,
Jeritt

On 2019/07/18 21:22:34, "Finan, Sean" <[email protected]> wrote: 
> Hi Tim, Remy,
> 
> The fake notes have non- UTF-8 formatting in the smoker/ directory.  You can 
> run the default pipeline on those files and look at various outputs (Pretty 
> Text, Pretty Property, Pretty Html) and you will see that ctakes maintains 
> offsets despite the encoding.
> 
> The FileTreeReader used by the Default Clinical Pipeline has the ability to 
> read and maintain different encodings as set by the optional parameter 
> "Encoding".  When not specified the encoding goes with the java default, 
> normally UTF-8.
> 
> The FileTreeReader actually reads a byte stream, not encoded characters.  By 
> default the -extra- bytes will be put in the document text and ctakes thinks 
> that they are odd (non-alpha ASCII) characters.   Therefore the text offsets 
> will not be messed up.  Individual engines may or may not be impacted by the 
> non-alpha characters.  For instance, I have noticed that cleartk annotators 
> slow down when presented with these documents - e.g. smoker/doc2_*past_smoker 
> has 137 words on 32 lines, but assertion takes 2 full seconds.
> 
> I think that the problem arises because the rest interface accepts a posted 
> string (any format / unicode) and no byte -to- UTF-8 is performed.  Each 
> annotator in the pipeline is left up to its own devices with respect to 
> handling or not handling special characters.
> 
> We can try to perform a similar conversion (string -to- raw byte, byte to 
> string) in the CtakesRestController.
> 
> Sean
> 
> 
> 
> ________________________________
> From: Remy Sanouillet <[email protected]>
> Sent: Thursday, July 18, 2019 5:06 PM
> To: [email protected]
> Subject: Re: unicode issues [EXTERNAL]
> 
> From my experience, cTakes is fully capable of dealing with Unicode input 
> since even the default dictionary contains some diacritics and those entries 
> are recognized. My guess is that something is getting lost in translation in 
> the encoding/decoding occuring around the REST api. You have to be very 
> careful with python to specify the correct encoding when doing any Unicode 
> text transfer.
> 
> Rémy Sanouillet
> NLP Engineer
> [email protected]<mailto:[email protected]>
> 
> 
> [cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
> ForeSee Medical, Inc.
> 12555 High Bluff Drive, Suite 100
> San Diego, CA 92130
> 
> NOTICE: This e-mail message and all attachments transmitted with it are 
> intended solely for the use of the addressee and may contain legally 
> privileged and confidential information. If the reader of this message is not 
> the intended recipient, or an employee or agent responsible for delivering 
> this message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, copying, or other use of this message or its 
> attachments is strictly prohibited. If you have received this message in 
> error, please notify the sender immediately by replying to this message and 
> please delete it from your computer.
> 
> 
> On Thu, Jul 18, 2019 at 1:47 PM Miller, Timothy 
> <[email protected]<mailto:[email protected]>>
>  wrote:
> Thanks Remy, that makes sense, but I'm wondering why I get the correct 
> offsets in one way of accessing ctakes (the CVD) but the wrong offsets 
> through another way (the REST interface)?
> 
> I guess for the fake notes I'm fully in favor of saving as plain text/ascii 
> files to simplify things. But there are more unicode characters than we can 
> write smart rules for and I'd like to make sure unicode strings at least 
> don't screw up offsets, even if we don't process them meaningfully. I'm sure 
> we all look forward to generation Z doctor's notes that use the thumbs 
> up/down emojis for patient prognosis :).
> 
> Tim
> 
> 
> 
> -----Original Message-----
> From: Remy Sanouillet 
> <[email protected]<mailto:[email protected]><mailto:remy%20sanouillet%20%[email protected]<mailto:remy%2520sanouillet%2520%[email protected]>%3e>>
> Reply-to: <[email protected]<mailto:[email protected]>>
> To: 
> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>
> Subject: Re: unicode issues [EXTERNAL]
> Date: Thu, 18 Jul 2019 13:37:33 -0700
> 
> Hi Tim,
> 
> What is happening is that your o'clock contains a smart quote (Unicode 
> U+2019) which is encoded as three bytes: 0x6f9980, so you have to take those 
> two extra bytes into account when counting offsets. For that particular 
> character, it is much easier to just preprocess the text and replace all 
> occurrences with the simple apostrophe (ASCII 0x6f). The one on your 
> keyboard. It won't change any interpretation and it makes life simpler for 
> everyone downstream. You probably will want to deal with all extended Unicode 
> characters like emojis otherwise, you will encounter the same offset issues.
> 
> Rémy Sanouillet
> NLP Engineer
> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>
> 
> 
> [cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
> ForeSee Medical, Inc.
> 12555 High Bluff Drive, Suite 100
> San Diego, CA 92130
> 
> NOTICE: This e-mail message and all attachments transmitted with it are 
> intended solely for the use of the addressee and may contain legally 
> privileged and confidential information. If the reader of this message is not 
> the intended recipient, or an employee or agent responsible for delivering 
> this message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, copying, or other use of this message or its 
> attachments is strictly prohibited. If you have received this message in 
> error, please notify the sender immediately by replying to this message and 
> please delete it from your computer.
> 
> 
> On Thu, Jul 18, 2019 at 1:20 PM Miller, Timothy 
> <[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
>  wrote:
> I'm having a weird issue with unicode characters in one of the sample notes 
> distributed with ctakes. The sentence is:
> 
> The right breast and axilla were sterilely prepped and draped in the usual 
> standard fashion.  First the right 1 o'clock position 5 cm from the nipple 
> was targeted.  Local anesthesia was obtained with 2% xylocaine.  A small skin 
> incision was made.  Under ultrasound guidance from a medial approach, 2 
> passes with a 14 gauge biopsy device were performed and sent to pathology.  A 
> clip was placed.
> 
> The unicode characters are the right single quotes in "o'clock". If I just 
> put it in the CVD everything works fine, e.g. I find the drug "xylocaine" at 
> location 203-212 and it's highlighted correctly. However, if I use the REST 
> interface and send it using the python requests API, I get back the span 
> 205:214. If we then grab that span we get the wrong string (offset by 2, so 
> something like "locaine. "
> 
> Any thoughts on where things might be going wrong for the REST interface? 
> Does anyone more knowledgeable than me know how UIMA and cTAKES (and java for 
> that matter) normally handle unicode?
> 
> Tim
> 
> 
>

Re: unicode issues [EXTERNAL]

Reply via email to