subject:"unicode issues"

Re: unicode issues [EXTERNAL]

2019-09-22 Thread Finan, Sean

Hi Jeritt,

I checked in a change to FileTreeReader.  There is good and bad:  The bad is 
that it will ignore any encoding explicitly set by the user.  The good is that 
it will bypass the buffer-to-String step, so as long as Java figures out the 
encoding there should be no problems with buffers cutting characters in half.

My tests have worked on different 3 encodings, but if anybody out there has 
problems then please let me know.

Thanks again for making me aware of a problem.

Sean

From: Jeritt Thayer 
Sent: Tuesday, September 17, 2019 2:03 PM
To: dev@ctakes.apache.org
Subject: Re: unicode issues [EXTERNAL]

Hi Sean,

Thanks for the information. I was having a similar issue related to "spans" 
occasionally being off by one when running cTAKES 4.0.0 in two different modes 
- a modified entry point for a Spark cluster and validation of a random subset 
using runClinicalPipeline.sh.

I was looking through the FileTreeReader class and noticed something that I 
think may have contributed to the discrepancies. The following line 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ctakes_blob_7f6dfd7d20253f88c25bea2fdde5cf22b004b63d_ctakes-2Dcore_src_main_java_org_apache_ctakes_core_cr_FileTreeReader.java-23L243=DwIBaQ=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=AdmrGg-g9T2SpuyempiTz8pKMeK0xDSFufT3r6bAefI=KWepFY7D3KjFInbdGF2_T-K-GGpfYgUmREq49VRxP_A=
 ) sets a buffer to 8192, which will read in the first 8192 bytes. At that 
point, this first byte array is converted into a string.

What I noticed for some of our documents is that the last position in the byte 
array would occur in the middle of a multiple byte character. As a result, the 
method tries to convert the first part of the character’s bytes to a string on 
the first loop, and then tries to convert the second portion on the second 
iteration. This results in an additional character, which I think is ultimately 
causing our "span" discrepancy.

Does my thought process make sense with your understanding of the code?

Thanks,
Jeritt

On 2019/07/18 21:22:34, "Finan, Sean"  wrote:
> Hi Tim, Remy,
>
> The fake notes have non- UTF-8 formatting in the smoker/ directory.  You can 
> run the default pipeline on those files and look at various outputs (Pretty 
> Text, Pretty Property, Pretty Html) and you will see that ctakes maintains 
> offsets despite the encoding.
>
> The FileTreeReader used by the Default Clinical Pipeline has the ability to 
> read and maintain different encodings as set by the optional parameter 
> "Encoding".  When not specified the encoding goes with the java default, 
> normally UTF-8.
>
> The FileTreeReader actually reads a byte stream, not encoded characters.  By 
> default the -extra- bytes will be put in the document text and ctakes thinks 
> that they are odd (non-alpha ASCII) characters.   Therefore the text offsets 
> will not be messed up.  Individual engines may or may not be impacted by the 
> non-alpha characters.  For instance, I have noticed that cleartk annotators 
> slow down when presented with these documents - e.g. smoker/doc2_*past_smoker 
> has 137 words on 32 lines, but assertion takes 2 full seconds.
>
> I think that the problem arises because the rest interface accepts a posted 
> string (any format / unicode) and no byte -to- UTF-8 is performed.  Each 
> annotator in the pipeline is left up to its own devices with respect to 
> handling or not handling special characters.
>
> We can try to perform a similar conversion (string -to- raw byte, byte to 
> string) in the CtakesRestController.
>
> Sean
>
>
>
> ________
> From: Remy Sanouillet 
> Sent: Thursday, July 18, 2019 5:06 PM
> To: dev@ctakes.apache.org
> Subject: Re: unicode issues [EXTERNAL]
>
> From my experience, cTakes is fully capable of dealing with Unicode input 
> since even the default dictionary contains some diacritics and those entries 
> are recognized. My guess is that something is getting lost in translation in 
> the encoding/decoding occuring around the REST api. You have to be very 
> careful with python to specify the correct encoding when doing any Unicode 
> text transfer.
>
> Rémy Sanouillet
> NLP Engineer
> re...@foreseemed.com<mailto:xx...@foreseemed.com>
>
>
> [cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
> ForeSee Medical, Inc.
> 12555 High Bluff Drive, Suite 100
> San Diego, CA 92130
>
> NOTICE: This e-mail message and all attachments transmitted with it are 
> intended solely for the use of the addressee and may contain legally 
> privileged and confidential information. If the reader of this message is not 
> the intended recipient, or an employee or agent responsible for

Re: unicode issues [EXTERNAL]

2019-09-17 Thread Finan, Sean

Hi Jeritt,

That makes perfect sense.  I will ruminate on possible solutions.  Somebody 
must have dealt with this elsewhere.

Thanks,
Sean

From: Jeritt Thayer 
Sent: Tuesday, September 17, 2019 2:03 PM
To: dev@ctakes.apache.org
Subject: Re: unicode issues [EXTERNAL]

Hi Sean,

Thanks for the information. I was having a similar issue related to "spans" 
occasionally being off by one when running cTAKES 4.0.0 in two different modes 
- a modified entry point for a Spark cluster and validation of a random subset 
using runClinicalPipeline.sh.

I was looking through the FileTreeReader class and noticed something that I 
think may have contributed to the discrepancies. The following line 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ctakes_blob_7f6dfd7d20253f88c25bea2fdde5cf22b004b63d_ctakes-2Dcore_src_main_java_org_apache_ctakes_core_cr_FileTreeReader.java-23L243=DwIBaQ=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=AdmrGg-g9T2SpuyempiTz8pKMeK0xDSFufT3r6bAefI=KWepFY7D3KjFInbdGF2_T-K-GGpfYgUmREq49VRxP_A=
 ) sets a buffer to 8192, which will read in the first 8192 bytes. At that 
point, this first byte array is converted into a string.

What I noticed for some of our documents is that the last position in the byte 
array would occur in the middle of a multiple byte character. As a result, the 
method tries to convert the first part of the character’s bytes to a string on 
the first loop, and then tries to convert the second portion on the second 
iteration. This results in an additional character, which I think is ultimately 
causing our "span" discrepancy.

Does my thought process make sense with your understanding of the code?

Thanks,
Jeritt

On 2019/07/18 21:22:34, "Finan, Sean"  wrote:
> Hi Tim, Remy,
>
> The fake notes have non- UTF-8 formatting in the smoker/ directory.  You can 
> run the default pipeline on those files and look at various outputs (Pretty 
> Text, Pretty Property, Pretty Html) and you will see that ctakes maintains 
> offsets despite the encoding.
>
> The FileTreeReader used by the Default Clinical Pipeline has the ability to 
> read and maintain different encodings as set by the optional parameter 
> "Encoding".  When not specified the encoding goes with the java default, 
> normally UTF-8.
>
> The FileTreeReader actually reads a byte stream, not encoded characters.  By 
> default the -extra- bytes will be put in the document text and ctakes thinks 
> that they are odd (non-alpha ASCII) characters.   Therefore the text offsets 
> will not be messed up.  Individual engines may or may not be impacted by the 
> non-alpha characters.  For instance, I have noticed that cleartk annotators 
> slow down when presented with these documents - e.g. smoker/doc2_*past_smoker 
> has 137 words on 32 lines, but assertion takes 2 full seconds.
>
> I think that the problem arises because the rest interface accepts a posted 
> string (any format / unicode) and no byte -to- UTF-8 is performed.  Each 
> annotator in the pipeline is left up to its own devices with respect to 
> handling or not handling special characters.
>
> We can try to perform a similar conversion (string -to- raw byte, byte to 
> string) in the CtakesRestController.
>
> Sean
>
>
>
> ________
> From: Remy Sanouillet 
> Sent: Thursday, July 18, 2019 5:06 PM
> To: dev@ctakes.apache.org
> Subject: Re: unicode issues [EXTERNAL]
>
> From my experience, cTakes is fully capable of dealing with Unicode input 
> since even the default dictionary contains some diacritics and those entries 
> are recognized. My guess is that something is getting lost in translation in 
> the encoding/decoding occuring around the REST api. You have to be very 
> careful with python to specify the correct encoding when doing any Unicode 
> text transfer.
>
> Rémy Sanouillet
> NLP Engineer
> re...@foreseemed.com<mailto:xx...@foreseemed.com>
>
>
> [cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
> ForeSee Medical, Inc.
> 12555 High Bluff Drive, Suite 100
> San Diego, CA 92130
>
> NOTICE: This e-mail message and all attachments transmitted with it are 
> intended solely for the use of the addressee and may contain legally 
> privileged and confidential information. If the reader of this message is not 
> the intended recipient, or an employee or agent responsible for delivering 
> this message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, copying, or other use of this message or its 
> attachments is strictly prohibited. If you have received this message in 
> error, please notify the sender immediately by replying to this message and 
> please delete it from your

Re: unicode issues [EXTERNAL]

2019-09-17 Thread Jeritt Thayer

Hi Sean,

Thanks for the information. I was having a similar issue related to "spans" 
occasionally being off by one when running cTAKES 4.0.0 in two different modes 
- a modified entry point for a Spark cluster and validation of a random subset 
using runClinicalPipeline.sh.

I was looking through the FileTreeReader class and noticed something that I 
think may have contributed to the discrepancies. The following line 
(https://github.com/apache/ctakes/blob/7f6dfd7d20253f88c25bea2fdde5cf22b004b63d/ctakes-core/src/main/java/org/apache/ctakes/core/cr/FileTreeReader.java#L243)
 sets a buffer to 8192, which will read in the first 8192 bytes. At that point, 
this first byte array is converted into a string.

What I noticed for some of our documents is that the last position in the byte 
array would occur in the middle of a multiple byte character. As a result, the 
method tries to convert the first part of the character’s bytes to a string on 
the first loop, and then tries to convert the second portion on the second 
iteration. This results in an additional character, which I think is ultimately 
causing our "span" discrepancy.

Does my thought process make sense with your understanding of the code?

Thanks,
Jeritt

On 2019/07/18 21:22:34, "Finan, Sean"  wrote: 
> Hi Tim, Remy,
> 
> The fake notes have non- UTF-8 formatting in the smoker/ directory.  You can 
> run the default pipeline on those files and look at various outputs (Pretty 
> Text, Pretty Property, Pretty Html) and you will see that ctakes maintains 
> offsets despite the encoding.
> 
> The FileTreeReader used by the Default Clinical Pipeline has the ability to 
> read and maintain different encodings as set by the optional parameter 
> "Encoding".  When not specified the encoding goes with the java default, 
> normally UTF-8.
> 
> The FileTreeReader actually reads a byte stream, not encoded characters.  By 
> default the -extra- bytes will be put in the document text and ctakes thinks 
> that they are odd (non-alpha ASCII) characters.   Therefore the text offsets 
> will not be messed up.  Individual engines may or may not be impacted by the 
> non-alpha characters.  For instance, I have noticed that cleartk annotators 
> slow down when presented with these documents - e.g. smoker/doc2_*past_smoker 
> has 137 words on 32 lines, but assertion takes 2 full seconds.
> 
> I think that the problem arises because the rest interface accepts a posted 
> string (any format / unicode) and no byte -to- UTF-8 is performed.  Each 
> annotator in the pipeline is left up to its own devices with respect to 
> handling or not handling special characters.
> 
> We can try to perform a similar conversion (string -to- raw byte, byte to 
> string) in the CtakesRestController.
> 
> Sean
> 
> 
> 
> ________
> From: Remy Sanouillet 
> Sent: Thursday, July 18, 2019 5:06 PM
> To: dev@ctakes.apache.org
> Subject: Re: unicode issues [EXTERNAL]
> 
> From my experience, cTakes is fully capable of dealing with Unicode input 
> since even the default dictionary contains some diacritics and those entries 
> are recognized. My guess is that something is getting lost in translation in 
> the encoding/decoding occuring around the REST api. You have to be very 
> careful with python to specify the correct encoding when doing any Unicode 
> text transfer.
> 
> Rémy Sanouillet
> NLP Engineer
> re...@foreseemed.com<mailto:xx...@foreseemed.com>
> 
> 
> [cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
> ForeSee Medical, Inc.
> 12555 High Bluff Drive, Suite 100
> San Diego, CA 92130
> 
> NOTICE: This e-mail message and all attachments transmitted with it are 
> intended solely for the use of the addressee and may contain legally 
> privileged and confidential information. If the reader of this message is not 
> the intended recipient, or an employee or agent responsible for delivering 
> this message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, copying, or other use of this message or its 
> attachments is strictly prohibited. If you have received this message in 
> error, please notify the sender immediately by replying to this message and 
> please delete it from your computer.
> 
> 
> On Thu, Jul 18, 2019 at 1:47 PM Miller, Timothy 
> mailto:timothy.mil...@childrens.harvard.edu>>
>  wrote:
> Thanks Remy, that makes sense, but I'm wondering why I get the correct 
> offsets in one way of accessing ctakes (the CVD) but the wrong offsets 
> through another way (the REST interface)?
> 
> I guess for the fake notes I'm fully in favor of saving as plain text/ascii 
> files to simplify things. But there are more unicode characters than we can 
>

Re: unicode issues [EXTERNAL]

2019-07-18 Thread Finan, Sean

Hi Tim, Remy,

The fake notes have non- UTF-8 formatting in the smoker/ directory.  You can 
run the default pipeline on those files and look at various outputs (Pretty 
Text, Pretty Property, Pretty Html) and you will see that ctakes maintains 
offsets despite the encoding.

The FileTreeReader used by the Default Clinical Pipeline has the ability to 
read and maintain different encodings as set by the optional parameter 
"Encoding".  When not specified the encoding goes with the java default, 
normally UTF-8.

The FileTreeReader actually reads a byte stream, not encoded characters.  By 
default the -extra- bytes will be put in the document text and ctakes thinks 
that they are odd (non-alpha ASCII) characters.   Therefore the text offsets 
will not be messed up.  Individual engines may or may not be impacted by the 
non-alpha characters.  For instance, I have noticed that cleartk annotators 
slow down when presented with these documents - e.g. smoker/doc2_*past_smoker 
has 137 words on 32 lines, but assertion takes 2 full seconds.

I think that the problem arises because the rest interface accepts a posted 
string (any format / unicode) and no byte -to- UTF-8 is performed.  Each 
annotator in the pipeline is left up to its own devices with respect to 
handling or not handling special characters.

We can try to perform a similar conversion (string -to- raw byte, byte to 
string) in the CtakesRestController.

Sean

From: Remy Sanouillet 
Sent: Thursday, July 18, 2019 5:06 PM
To: dev@ctakes.apache.org
Subject: Re: unicode issues [EXTERNAL]

>From my experience, cTakes is fully capable of dealing with Unicode input 
>since even the default dictionary contains some diacritics and those entries 
>are recognized. My guess is that something is getting lost in translation in 
>the encoding/decoding occuring around the REST api. You have to be very 
>careful with python to specify the correct encoding when doing any Unicode 
>text transfer.

Rémy Sanouillet
NLP Engineer
re...@foreseemed.com<mailto:xx...@foreseemed.com>

[cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
ForeSee Medical, Inc.
12555 High Bluff Drive, Suite 100
San Diego, CA 92130

NOTICE: This e-mail message and all attachments transmitted with it are 
intended solely for the use of the addressee and may contain legally privileged 
and confidential information. If the reader of this message is not the intended 
recipient, or an employee or agent responsible for delivering this message to 
the intended recipient, you are hereby notified that any dissemination, 
distribution, copying, or other use of this message or its attachments is 
strictly prohibited. If you have received this message in error, please notify 
the sender immediately by replying to this message and please delete it from 
your computer.

On Thu, Jul 18, 2019 at 1:47 PM Miller, Timothy 
mailto:timothy.mil...@childrens.harvard.edu>>
 wrote:
Thanks Remy, that makes sense, but I'm wondering why I get the correct offsets 
in one way of accessing ctakes (the CVD) but the wrong offsets through another 
way (the REST interface)?

I guess for the fake notes I'm fully in favor of saving as plain text/ascii 
files to simplify things. But there are more unicode characters than we can 
write smart rules for and I'd like to make sure unicode strings at least don't 
screw up offsets, even if we don't process them meaningfully. I'm sure we all 
look forward to generation Z doctor's notes that use the thumbs up/down emojis 
for patient prognosis :).

Tim

-Original Message-
From: Remy Sanouillet 
mailto:re...@foreseemed.com><mailto:remy%20sanouillet%20%3cre...@foreseemed.com<mailto:remy%2520sanouillet%2520%253cre...@foreseemed.com>%3e>>
Reply-to: mailto:dev@ctakes.apache.org>>
To: 
dev@ctakes.apache.org<mailto:dev@ctakes.apache.org><mailto:dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>>
Subject: Re: unicode issues [EXTERNAL]
Date: Thu, 18 Jul 2019 13:37:33 -0700

Hi Tim,

What is happening is that your o'clock contains a smart quote (Unicode U+2019) 
which is encoded as three bytes: 0x6f9980, so you have to take those two extra 
bytes into account when counting offsets. For that particular character, it is 
much easier to just preprocess the text and replace all occurrences with the 
simple apostrophe (ASCII 0x6f). The one on your keyboard. It won't change any 
interpretation and it makes life simpler for everyone downstream. You probably 
will want to deal with all extended Unicode characters like emojis otherwise, 
you will encounter the same offset issues.

Rémy Sanouillet
NLP Engineer
re...@foreseemed.com<mailto:re...@foreseemed.com><mailto:xx...@foreseemed.com<mailto:xx...@foreseemed.com>>

[cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
ForeSee Medical, Inc.
12555 High Bluff Drive, Suite 100
San Diego, CA 92130

NOTICE: This e-mail message and all attac

Re: unicode issues [EXTERNAL]

2019-07-18 Thread Remy Sanouillet

>From my experience, cTakes is fully capable of dealing with Unicode input
since even the default dictionary contains some diacritics and those
entries are recognized. My guess is that something is getting lost in
translation in the encoding/decoding occuring around the REST api. You have
to be very careful with python to specify the correct encoding when doing
any Unicode text transfer.

*Rémy Sanouillet*
NLP Engineer
re...@foreseemed.com 


[image: cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
ForeSee Medical, Inc.
12555 High Bluff Drive, Suite 100
San Diego, CA 92130

NOTICE: This e-mail message and all attachments transmitted with it are
intended solely for the use of the addressee and may contain legally
privileged and confidential information. If the reader of this message is
not the intended recipient, or an employee or agent responsible for
delivering this message to the intended recipient, you are hereby notified
that any dissemination, distribution, copying, or other use of this message
or its attachments is strictly prohibited. If you have received this
message in error, please notify the sender immediately by replying to this
message and please delete it from your computer.


On Thu, Jul 18, 2019 at 1:47 PM Miller, Timothy <
timothy.mil...@childrens.harvard.edu> wrote:

> Thanks Remy, that makes sense, but I'm wondering why I get the correct
> offsets in one way of accessing ctakes (the CVD) but the wrong offsets
> through another way (the REST interface)?
>
> I guess for the fake notes I'm fully in favor of saving as plain
> text/ascii files to simplify things. But there are more unicode characters
> than we can write smart rules for and I'd like to make sure unicode strings
> at least don't screw up offsets, even if we don't process them
> meaningfully. I'm sure we all look forward to generation Z doctor's notes
> that use the thumbs up/down emojis for patient prognosis :).
>
> Tim
>
>
>
> -Original Message-
> From: Remy Sanouillet  remy%20sanouillet%20%3cre...@foreseemed.com%3e>>
> Reply-to: 
> To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
> Subject: Re: unicode issues [EXTERNAL]
> Date: Thu, 18 Jul 2019 13:37:33 -0700
>
> Hi Tim,
>
> What is happening is that your o'clock contains a smart quote (Unicode
> U+2019) which is encoded as three bytes: 0x6f9980, so you have to take
> those two extra bytes into account when counting offsets. For that
> particular character, it is much easier to just preprocess the text and
> replace all occurrences with the simple apostrophe (ASCII 0x6f). The one on
> your keyboard. It won't change any interpretation and it makes life simpler
> for everyone downstream. You probably will want to deal with all extended
> Unicode characters like emojis otherwise, you will encounter the same
> offset issues.
>
> Rémy Sanouillet
> NLP Engineer
> re...@foreseemed.com<mailto:xx...@foreseemed.com>
>
>
> [cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
> ForeSee Medical, Inc.
> 12555 High Bluff Drive, Suite 100
> San Diego, CA 92130
>
> NOTICE: This e-mail message and all attachments transmitted with it are
> intended solely for the use of the addressee and may contain legally
> privileged and confidential information. If the reader of this message is
> not the intended recipient, or an employee or agent responsible for
> delivering this message to the intended recipient, you are hereby notified
> that any dissemination, distribution, copying, or other use of this message
> or its attachments is strictly prohibited. If you have received this
> message in error, please notify the sender immediately by replying to this
> message and please delete it from your computer.
>
>
> On Thu, Jul 18, 2019 at 1:20 PM Miller, Timothy <
> timothy.mil...@childrens.harvard.edu timothy.mil...@childrens.harvard.edu>> wrote:
> I'm having a weird issue with unicode characters in one of the sample
> notes distributed with ctakes. The sentence is:
>
> The right breast and axilla were sterilely prepped and draped in the usual
> standard fashion.  First the right 1 o’clock position 5 cm from the nipple
> was targeted.  Local anesthesia was obtained with 2% xylocaine.  A small
> skin incision was made.  Under ultrasound guidance from a medial approach,
> 2 passes with a 14 gauge biopsy device were performed and sent to
> pathology.  A clip was placed.
>
> The unicode characters are the right single quotes in "o'clock". If I just
> put it in the CVD everything works fine, e.g. I find the drug "xylocaine"
> at location 203-212 and it's highlighted correctly. However, if I use the
> REST interface and send it using the python requests API, I get back the
> span 205:214. If we then grab that span we get the wrong string (offset by
> 2, so something like "locaine. "
>
> Any thoughts on where things might be going wrong for the REST interface?
> Does anyone more knowledgeable than me know how UIMA and cTAKES (and java
> for that matter) normally handle unicode?
>
> Tim
>
>
>

Re: unicode issues [EXTERNAL]

2019-07-18 Thread Miller, Timothy

Thanks Remy, that makes sense, but I'm wondering why I get the correct offsets 
in one way of accessing ctakes (the CVD) but the wrong offsets through another 
way (the REST interface)?

I guess for the fake notes I'm fully in favor of saving as plain text/ascii 
files to simplify things. But there are more unicode characters than we can 
write smart rules for and I'd like to make sure unicode strings at least don't 
screw up offsets, even if we don't process them meaningfully. I'm sure we all 
look forward to generation Z doctor's notes that use the thumbs up/down emojis 
for patient prognosis :).

Tim



-Original Message-
From: Remy Sanouillet 
mailto:remy%20sanouillet%20%3cre...@foreseemed.com%3e>>
Reply-to: 
To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
Subject: Re: unicode issues [EXTERNAL]
Date: Thu, 18 Jul 2019 13:37:33 -0700

Hi Tim,

What is happening is that your o'clock contains a smart quote (Unicode U+2019) 
which is encoded as three bytes: 0x6f9980, so you have to take those two extra 
bytes into account when counting offsets. For that particular character, it is 
much easier to just preprocess the text and replace all occurrences with the 
simple apostrophe (ASCII 0x6f). The one on your keyboard. It won't change any 
interpretation and it makes life simpler for everyone downstream. You probably 
will want to deal with all extended Unicode characters like emojis otherwise, 
you will encounter the same offset issues.

Rémy Sanouillet
NLP Engineer
re...@foreseemed.com<mailto:xx...@foreseemed.com>


[cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
ForeSee Medical, Inc.
12555 High Bluff Drive, Suite 100
San Diego, CA 92130

NOTICE: This e-mail message and all attachments transmitted with it are 
intended solely for the use of the addressee and may contain legally privileged 
and confidential information. If the reader of this message is not the intended 
recipient, or an employee or agent responsible for delivering this message to 
the intended recipient, you are hereby notified that any dissemination, 
distribution, copying, or other use of this message or its attachments is 
strictly prohibited. If you have received this message in error, please notify 
the sender immediately by replying to this message and please delete it from 
your computer.


On Thu, Jul 18, 2019 at 1:20 PM Miller, Timothy 
mailto:timothy.mil...@childrens.harvard.edu>>
 wrote:
I'm having a weird issue with unicode characters in one of the sample notes 
distributed with ctakes. The sentence is:

The right breast and axilla were sterilely prepped and draped in the usual 
standard fashion.  First the right 1 o’clock position 5 cm from the nipple was 
targeted.  Local anesthesia was obtained with 2% xylocaine.  A small skin 
incision was made.  Under ultrasound guidance from a medial approach, 2 passes 
with a 14 gauge biopsy device were performed and sent to pathology.  A clip was 
placed.

The unicode characters are the right single quotes in "o'clock". If I just put 
it in the CVD everything works fine, e.g. I find the drug "xylocaine" at 
location 203-212 and it's highlighted correctly. However, if I use the REST 
interface and send it using the python requests API, I get back the span 
205:214. If we then grab that span we get the wrong string (offset by 2, so 
something like "locaine. "

Any thoughts on where things might be going wrong for the REST interface? Does 
anyone more knowledgeable than me know how UIMA and cTAKES (and java for that 
matter) normally handle unicode?

Tim

Re: unicode issues

2019-07-18 Thread Remy Sanouillet

Hi Tim,

What is happening is that your o'clock contains a smart quote (Unicode
U+2019) which is encoded as three bytes: 0x6f9980, so you have to take
those two extra bytes into account when counting offsets. For that
particular character, it is much easier to just preprocess the text and
replace all occurrences with the simple apostrophe (ASCII 0x6f). The one on
your keyboard. It won't change any interpretation and it makes life simpler
for everyone downstream. You probably will want to deal with all extended
Unicode characters like emojis otherwise, you will encounter the same
offset issues.

*Rémy Sanouillet*
NLP Engineer
re...@foreseemed.com 

[image: cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
ForeSee Medical, Inc.
12555 High Bluff Drive, Suite 100
San Diego, CA 92130

NOTICE: This e-mail message and all attachments transmitted with it are
intended solely for the use of the addressee and may contain legally
privileged and confidential information. If the reader of this message is
not the intended recipient, or an employee or agent responsible for
delivering this message to the intended recipient, you are hereby notified
that any dissemination, distribution, copying, or other use of this message
or its attachments is strictly prohibited. If you have received this
message in error, please notify the sender immediately by replying to this
message and please delete it from your computer.

On Thu, Jul 18, 2019 at 1:20 PM Miller, Timothy <
timothy.mil...@childrens.harvard.edu> wrote:

> I'm having a weird issue with unicode characters in one of the sample
> notes distributed with ctakes. The sentence is:
>
> The right breast and axilla were sterilely prepped and draped in the usual
> standard fashion.  First the right 1 o’clock position 5 cm from the nipple
> was targeted.  Local anesthesia was obtained with 2% xylocaine.  A small
> skin incision was made.  Under ultrasound guidance from a medial approach,
> 2 passes with a 14 gauge biopsy device were performed and sent to
> pathology.  A clip was placed.
>
> The unicode characters are the right single quotes in "o'clock". If I just
> put it in the CVD everything works fine, e.g. I find the drug "xylocaine"
> at location 203-212 and it's highlighted correctly. However, if I use the
> REST interface and send it using the python requests API, I get back the
> span 205:214. If we then grab that span we get the wrong string (offset by
> 2, so something like "locaine. "
>
> Any thoughts on where things might be going wrong for the REST interface?
> Does anyone more knowledgeable than me know how UIMA and cTAKES (and java
> for that matter) normally handle unicode?
>
> Tim
>
>

unicode issues

2019-07-18 Thread Miller, Timothy

I'm having a weird issue with unicode characters in one of the sample notes 
distributed with ctakes. The sentence is:

The right breast and axilla were sterilely prepped and draped in the usual 
standard fashion.  First the right 1 o’clock position 5 cm from the nipple was 
targeted.  Local anesthesia was obtained with 2% xylocaine.  A small skin 
incision was made.  Under ultrasound guidance from a medial approach, 2 passes 
with a 14 gauge biopsy device were performed and sent to pathology.  A clip was 
placed.

The unicode characters are the right single quotes in "o'clock". If I just put 
it in the CVD everything works fine, e.g. I find the drug "xylocaine" at 
location 203-212 and it's highlighted correctly. However, if I use the REST 
interface and send it using the python requests API, I get back the span 
205:214. If we then grab that span we get the wrong string (offset by 2, so 
something like "locaine. "

Any thoughts on where things might be going wrong for the REST interface? Does 
anyone more knowledgeable than me know how UIMA and cTAKES (and java for that 
matter) normally handle unicode?

Tim

Re: unicode issues [EXTERNAL]

Re: unicode issues [EXTERNAL]

Re: unicode issues [EXTERNAL]

Re: unicode issues [EXTERNAL]

Re: unicode issues [EXTERNAL]

Re: unicode issues [EXTERNAL]

Re: unicode issues

unicode issues

8 matches

Site Navigation

Mail list logo

Footer information