Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

2019-07-17 Thread Finan, Sean
Hi All,

ctakes-scrubber is not in any ctakes release and it is not in the main 
repository.  It never went beyond experimental and resides within the ctakes 
sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/

>From what I recall, scrubber does not have "real" name replacement, but 
>instead de-identifies entities by removing them and inserting a tag indicating 
>the type of entity.  For instance:  "John has a rash" -> "[person] has a 
>rash".   That is not verbatim, but it is the general idea.

If you can get ctakes-scrubber working in your project then it would be pretty 
easy to create an engine that does nothing except replace such generic tags 
with random names, dates, institutions, etc.

Sean

From: gandhi rajan 
Sent: Wednesday, July 17, 2019 12:26 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Hi Masoud, we had a similar requirement to identify patient names in the
narratives text and I had a discussion with Sean Finan on patient name
identification feature in cTAKES. What he told at that point in time was
cTAKES dint supported patient name identification feature. Also as far as I
know, I m not really sure whether scrubber made it to the cTAKES codebase.

Sean, Please correct me if I m wrong.

On Wednesday, July 17, 2019, Masoud Rouhizadeh  wrote:

> Dear cTAKES developer,
> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
> Institute for Clinical and Translational Research and work on
> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
> goals we are targeting is de-identification of a large number of notes
> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
> has been very helpful.
>
> One of our most desired features in the de-identification pipeline is
> synthetic replacement (e.g. Nancy->Sally; random female first name
> consistently replaces a female first name.). I wasn't able to find
> information about this feature in cTAKES Scrubber. Is synthetic replacement
> functionality part of the cTAKES Scrubber, or can it be added by
> post-processing the output? For instance, if we know the name Nancy is
> removed from multiple places, can we use a name dictionary to insert random
> female first names in those places (just a thought)?
> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
> candidates and I'm hoping that we could find ways to collaborate.
>
> Thank you very much,
> Masoud
>
> 
> Masoud Rouhizadeh, PhD
> Faculty - Division of Health Science Informatics (DHSI)
> NLP Lead - Institute for Clinical and Translational Research (ICTR)
> Johns Hopkins University School of Medicine
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=
>
>

--
Regards,
Gandhi

"The best way to find urself is to lose urself in the service of others !!!"


Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

2019-07-17 Thread gandhi rajan
Thanks for the insight Sean.

On Wednesday, July 17, 2019, Finan, Sean 
wrote:

> Hi All,
>
> ctakes-scrubber is not in any ctakes release and it is not in the main
> repository.  It never went beyond experimental and resides within the
> ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/
>
> From what I recall, scrubber does not have "real" name replacement, but
> instead de-identifies entities by removing them and inserting a tag
> indicating the type of entity.  For instance:  "John has a rash" ->
> "[person] has a rash".   That is not verbatim, but it is the general idea.
>
> If you can get ctakes-scrubber working in your project then it would be
> pretty easy to create an engine that does nothing except replace such
> generic tags with random names, dates, institutions, etc.
>
> Sean
> 
> From: gandhi rajan 
> Sent: Wednesday, July 17, 2019 12:26 PM
> To: dev@ctakes.apache.org
> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
>
> Hi Masoud, we had a similar requirement to identify patient names in the
> narratives text and I had a discussion with Sean Finan on patient name
> identification feature in cTAKES. What he told at that point in time was
> cTAKES dint supported patient name identification feature. Also as far as I
> know, I m not really sure whether scrubber made it to the cTAKES codebase.
>
> Sean, Please correct me if I m wrong.
>
> On Wednesday, July 17, 2019, Masoud Rouhizadeh  wrote:
>
> > Dear cTAKES developer,
> > This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
> > Institute for Clinical and Translational Research and work on
> > enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
> > goals we are targeting is de-identification of a large number of notes
> > (350M) to prepare them for search and indexing (Elasticsearch and Solr).
> I
> > have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
> > has been very helpful.
> >
> > One of our most desired features in the de-identification pipeline is
> > synthetic replacement (e.g. Nancy->Sally; random female first name
> > consistently replaces a female first name.). I wasn't able to find
> > information about this feature in cTAKES Scrubber. Is synthetic
> replacement
> > functionality part of the cTAKES Scrubber, or can it be added by
> > post-processing the output? For instance, if we know the name Nancy is
> > removed from multiple places, can we use a name dictionary to insert
> random
> > female first names in those places (just a thought)?
> > Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
> > candidates and I'm hoping that we could find ways to collaborate.
> >
> > Thank you very much,
> > Masoud
> >
> > 
> > Masoud Rouhizadeh, PhD
> > Faculty - Division of Health Science Informatics (DHSI)
> > NLP Lead - Institute for Clinical and Translational Research (ICTR)
> > Johns Hopkins University School of Medicine
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.
> jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
> aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_
> sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=
> >
> >
>
> --
> Regards,
> Gandhi
>
> "The best way to find urself is to lose urself in the service of others
> !!!"
>


-- 
Regards,
Gandhi

"The best way to find urself is to lose urself in the service of others !!!"


Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

2019-07-17 Thread Ravi Tejwani
How can I un-subscribe from this? Any help would be kindly appreciated. 

- Ravi

> On Jul 17, 2019, at 12:53 PM, gandhi rajan  wrote:
> 
> Thanks for the insight Sean.
> 
> On Wednesday, July 17, 2019, Finan, Sean 
> wrote:
> 
>> Hi All,
>> 
>> ctakes-scrubber is not in any ctakes release and it is not in the main
>> repository.  It never went beyond experimental and resides within the
>> ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/
>> 
>> From what I recall, scrubber does not have "real" name replacement, but
>> instead de-identifies entities by removing them and inserting a tag
>> indicating the type of entity.  For instance:  "John has a rash" ->
>> "[person] has a rash".   That is not verbatim, but it is the general idea.
>> 
>> If you can get ctakes-scrubber working in your project then it would be
>> pretty easy to create an engine that does nothing except replace such
>> generic tags with random names, dates, institutions, etc.
>> 
>> Sean
>> ________________
>> From: gandhi rajan 
>> Sent: Wednesday, July 17, 2019 12:26 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
>> 
>> Hi Masoud, we had a similar requirement to identify patient names in the
>> narratives text and I had a discussion with Sean Finan on patient name
>> identification feature in cTAKES. What he told at that point in time was
>> cTAKES dint supported patient name identification feature. Also as far as I
>> know, I m not really sure whether scrubber made it to the cTAKES codebase.
>> 
>> Sean, Please correct me if I m wrong.
>> 
>> On Wednesday, July 17, 2019, Masoud Rouhizadeh  wrote:
>> 
>>> Dear cTAKES developer,
>>> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
>>> Institute for Clinical and Translational Research and work on
>>> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
>>> goals we are targeting is de-identification of a large number of notes
>>> (350M) to prepare them for search and indexing (Elasticsearch and Solr).
>> I
>>> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
>>> has been very helpful.
>>> 
>>> One of our most desired features in the de-identification pipeline is
>>> synthetic replacement (e.g. Nancy->Sally; random female first name
>>> consistently replaces a female first name.). I wasn't able to find
>>> information about this feature in cTAKES Scrubber. Is synthetic
>> replacement
>>> functionality part of the cTAKES Scrubber, or can it be added by
>>> post-processing the output? For instance, if we know the name Nancy is
>>> removed from multiple places, can we use a name dictionary to insert
>> random
>>> female first names in those places (just a thought)?
>>> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
>>> candidates and I'm hoping that we could find ways to collaborate.
>>> 
>>> Thank you very much,
>>> Masoud
>>> 
>>> 
>>> Masoud Rouhizadeh, PhD
>>> Faculty - Division of Health Science Informatics (DHSI)
>>> NLP Lead - Institute for Clinical and Translational Research (ICTR)
>>> Johns Hopkins University School of Medicine
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.
>> jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
>> fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
>> aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_
>> sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=
>>> 
>>> 
>> 
>> --
>> Regards,
>> Gandhi
>> 
>> "The best way to find urself is to lose urself in the service of others
>> !!!"
>> 
> 
> 
> -- 
> Regards,
> Gandhi
> 
> "The best way to find urself is to lose urself in the service of others !!!"



Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

2019-07-17 Thread Peter Szolovits
My group has done considerable work on de-identification and on synthesizing 
pseudonymous data to replace the original PHI with plausible but inauthentic 
data (sometimes confusingly called re-identification). 

One conclusion I reached from that work is that the de-identification and the 
pseudonym generation should be tightly coupled. For example, if de-id replaces 
all people’s names by [person], then there is no way in the pseudonym 
generation to make sure that the same real person’s name is replaced by the 
same pseudonym in every occurrence, leading to much harder to interpret text.  
The same goes for other PHI categories.

I think it’s also important to keep similar formatting if the pseudonymized 
data are going to be used for NLP learning tasks.  So, for example, the format 
of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. 
Nicknames are a problem as well; if the same document also refers to Joe, and 
the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement 
for Joe should be Bob.  Gender is also tough because there are so many names 
that are either ambiguous or not in name dictionaries.

Date shifting also introduces pseudonymization problems.  For example, a 
patient admitted on December 15 may have a note saying they are expected to be 
discharged right after Christmas. If the admission date is shifted, say to 
mid-January, then retaining the discharge expectation would imply a very long 
anticipated hospital stay.

We published a paper on this topic:
https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27 
<https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27>

I also have some old Java code that deal with a few of these issues, and would 
be happy to share with anyone interested, though it’s far from production 
quality and does not address all the issues we know.

—Peter Szolovits

> On Jul 17, 2019, at 12:42 PM, Finan, Sean  
> wrote:
> 
> Hi All,
> 
> ctakes-scrubber is not in any ctakes release and it is not in the main 
> repository.  It never went beyond experimental and resides within the ctakes 
> sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/ 
> <https://svn.apache.org/repos/asf/ctakes/sandbox/>
> 
> From what I recall, scrubber does not have "real" name replacement, but 
> instead de-identifies entities by removing them and inserting a tag 
> indicating the type of entity.  For instance: "John has a rash" -> "[person] 
> has a rash".   That is not verbatim, but it is the general idea.
> 
> If you can get ctakes-scrubber working in your project then it would be 
> pretty easy to create an engine that does nothing except replace such generic 
> tags with random names, dates, institutions, etc.
> 
> Sean
> 
> From: gandhi rajan mailto:gandhiraja...@gmail.com>>
> Sent: Wednesday, July 17, 2019 12:26 PM
> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org>
> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
> 
> Hi Masoud, we had a similar requirement to identify patient names in the
> narratives text and I had a discussion with Sean Finan on patient name
> identification feature in cTAKES. What he told at that point in time was
> cTAKES dint supported patient name identification feature. Also as far as I
> know, I m not really sure whether scrubber made it to the cTAKES codebase.
> 
> Sean, Please correct me if I m wrong.
> 
> On Wednesday, July 17, 2019, Masoud Rouhizadeh  wrote:
> 
>> Dear cTAKES developer,
>> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
>> Institute for Clinical and Translational Research and work on
>> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
>> goals we are targeting is de-identification of a large number of notes
>> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
>> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
>> has been very helpful.
>> 
>> One of our most desired features in the de-identification pipeline is
>> synthetic replacement (e.g. Nancy->Sally; random female first name
>> consistently replaces a female first name.). I wasn't able to find
>> information about this feature in cTAKES Scrubber. Is synthetic replacement
>> functionality part of the cTAKES Scrubber, or can it be added by
>> post-processing the output? For instance, if we know the name Nancy is
>> removed from multiple places, can we use a name dictionary to insert random
>> female first names in those places (just a thought)?
>> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
>> candidates and I'm hoping that

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

2019-07-17 Thread Lingren, Todd
We had some similar work on de-id and "re-id".

The impact on performance for NER tasks was minimal.

https://academic.oup.com/jamia/article/20/1/84/2909298

The replacing PHI task was employed with data based on US CENSUS distribution.

https://www.sciencedirect.com/science/article/pii/S1532046414000161



--

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.ling...@cchmc.org
(513) 803-9032



From: Peter Szolovits 
Sent: Wednesday, July 17, 2019 1:12:21 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

My group has done considerable work on de-identification and on synthesizing 
pseudonymous data to replace the original PHI with plausible but inauthentic 
data (sometimes confusingly called re-identification).

One conclusion I reached from that work is that the de-identification and the 
pseudonym generation should be tightly coupled. For example, if de-id replaces 
all people’s names by [person], then there is no way in the pseudonym 
generation to make sure that the same real person’s name is replaced by the 
same pseudonym in every occurrence, leading to much harder to interpret text.  
The same goes for other PHI categories.

I think it’s also important to keep similar formatting if the pseudonymized 
data are going to be used for NLP learning tasks.  So, for example, the format 
of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. 
Nicknames are a problem as well; if the same document also refers to Joe, and 
the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement 
for Joe should be Bob.  Gender is also tough because there are so many names 
that are either ambiguous or not in name dictionaries.

Date shifting also introduces pseudonymization problems.  For example, a 
patient admitted on December 15 may have a note saying they are expected to be 
discharged right after Christmas. If the admission date is shifted, say to 
mid-January, then retaining the discharge expectation would imply a very long 
anticipated hospital stay.

We published a paper on this topic:
https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27 
<https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27>

I also have some old Java code that deal with a few of these issues, and would 
be happy to share with anyone interested, though it’s far from production 
quality and does not address all the issues we know.

—Peter Szolovits

> On Jul 17, 2019, at 12:42 PM, Finan, Sean  
> wrote:
>
> Hi All,
>
> ctakes-scrubber is not in any ctakes release and it is not in the main 
> repository.  It never went beyond experimental and resides within the ctakes 
> sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/ 
> <https://svn.apache.org/repos/asf/ctakes/sandbox/>
>
> From what I recall, scrubber does not have "real" name replacement, but 
> instead de-identifies entities by removing them and inserting a tag 
> indicating the type of entity.  For instance: "John has a rash" -> "[person] 
> has a rash".   That is not verbatim, but it is the general idea.
>
> If you can get ctakes-scrubber working in your project then it would be 
> pretty easy to create an engine that does nothing except replace such generic 
> tags with random names, dates, institutions, etc.
>
> Sean
> 
> From: gandhi rajan mailto:gandhiraja...@gmail.com>>
> Sent: Wednesday, July 17, 2019 12:26 PM
> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org>
> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
>
> Hi Masoud, we had a similar requirement to identify patient names in the
> narratives text and I had a discussion with Sean Finan on patient name
> identification feature in cTAKES. What he told at that point in time was
> cTAKES dint supported patient name identification feature. Also as far as I
> know, I m not really sure whether scrubber made it to the cTAKES codebase.
>
> Sean, Please correct me if I m wrong.
>
> On Wednesday, July 17, 2019, Masoud Rouhizadeh  wrote:
>
>> Dear cTAKES developer,
>> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
>> Institute for Clinical and Translational Research and work on
>> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
>> goals we are targeting is de-identification of a large number of notes
>> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
>> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
>> has been very helpful.
>>
>> One of our most desired features in the de-identification pipeline is
>> synthetic rep

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

2019-07-17 Thread gandhi rajan
Hi Ravi,

Send out an email to dev-unsubscr...@ctakes.apache.org to un-subscribe.

On Wednesday, July 17, 2019, Ravi Tejwani 
wrote:

> How can I un-subscribe from this? Any help would be kindly appreciated.
>
> - Ravi
>
> > On Jul 17, 2019, at 12:53 PM, gandhi rajan 
> wrote:
> >
> > Thanks for the insight Sean.
> >
> > On Wednesday, July 17, 2019, Finan, Sean <
> sean.fi...@childrens.harvard.edu>
> > wrote:
> >
> >> Hi All,
> >>
> >> ctakes-scrubber is not in any ctakes release and it is not in the main
> >> repository.  It never went beyond experimental and resides within the
> >> ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/
> >>
> >> From what I recall, scrubber does not have "real" name replacement, but
> >> instead de-identifies entities by removing them and inserting a tag
> >> indicating the type of entity.  For instance:  "John has a rash" ->
> >> "[person] has a rash".   That is not verbatim, but it is the general
> idea.
> >>
> >> If you can get ctakes-scrubber working in your project then it would be
> >> pretty easy to create an engine that does nothing except replace such
> >> generic tags with random names, dates, institutions, etc.
> >>
> >> Sean
> >> 
> >> From: gandhi rajan 
> >> Sent: Wednesday, July 17, 2019 12:26 PM
> >> To: dev@ctakes.apache.org
> >> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
> >>
> >> Hi Masoud, we had a similar requirement to identify patient names in the
> >> narratives text and I had a discussion with Sean Finan on patient name
> >> identification feature in cTAKES. What he told at that point in time was
> >> cTAKES dint supported patient name identification feature. Also as far
> as I
> >> know, I m not really sure whether scrubber made it to the cTAKES
> codebase.
> >>
> >> Sean, Please correct me if I m wrong.
> >>
> >> On Wednesday, July 17, 2019, Masoud Rouhizadeh  wrote:
> >>
> >>> Dear cTAKES developer,
> >>> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
> >>> Institute for Clinical and Translational Research and work on
> >>> enterprise-level NLP projects at Johns Hopkins Medicine. One of the
> major
> >>> goals we are targeting is de-identification of a large number of notes
> >>> (350M) to prepare them for search and indexing (Elasticsearch and
> Solr).
> >> I
> >>> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and
> she
> >>> has been very helpful.
> >>>
> >>> One of our most desired features in the de-identification pipeline is
> >>> synthetic replacement (e.g. Nancy->Sally; random female first name
> >>> consistently replaces a female first name.). I wasn't able to find
> >>> information about this feature in cTAKES Scrubber. Is synthetic
> >> replacement
> >>> functionality part of the cTAKES Scrubber, or can it be added by
> >>> post-processing the output? For instance, if we know the name Nancy is
> >>> removed from multiple places, can we use a name dictionary to insert
> >> random
> >>> female first names in those places (just a thought)?
> >>> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
> >>> candidates and I'm hoping that we could find ways to collaborate.
> >>>
> >>> Thank you very much,
> >>> Masoud
> >>>
> >>> 
> >>> Masoud Rouhizadeh, PhD
> >>> Faculty - Division of Health Science Informatics (DHSI)
> >>> NLP Lead - Institute for Clinical and Translational Research (ICTR)
> >>> Johns Hopkins University School of Medicine
> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.
> >> jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=
> >> fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
> >> aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_
> >> sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=
> >>>
> >>>
> >>
> >> --
> >> Regards,
> >> Gandhi
> >>
> >> "The best way to find urself is to lose urself in the service of others
> >> !!!"
> >>
> >
> >
> > --
> > Regards,
> > Gandhi
> >
> > "The best way to find urself is to lose urself in the service of others
> !!!"
>
>

-- 
Regards,
Gandhi

"The best way to find urself is to lose urself in the service of others !!!"


Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

2019-07-18 Thread Masoud Rouhizadeh
Thanks, everyone, for their great feedback. Very helpful insights! 

Here are a few comments and questions: 

(1) Peter: great paper! I agree that replacing the same real person’s name by 
the same pseudonym makes the text easier to interpret but on the other hand, 
wouldn't it make the de-identification less robust? I think if we pick a random 
pseudonym in each instance, it would be difficult to find the real name (in 
case it is missed by the de-id system) when it is surrounded by (lots of) 
pseudonyms. 

(2) Peter: I'd appreciate if you could share your code. That would be helpful 
indeed. 

(3) Todd: in your work, did you replace the same real person’s name by the same 
pseudonym across the note or you assigned a random name each time?  

(4) Date shifting can be complicated. In addition to the admission case that 
Peter pointed out, we would need to deal with consistency. Will shifting the 
date by a random yet consistent number across that single note is sufficient or 
should we do this at the patient level? For instance, if some signs and 
symptoms observed and reported 1 year before the diagnosis, this trajectory 
should be preserved. Age would be another issue. Some risk factors are 
age-specific.

(5) Does anyone have any thoughts of using metadata from structured fields 
(e.g. name, DOB, SSN, contact info) to help the note de-identification system? 
if the note de-id system is aware of the person's real name, we could make it 
more sensitive to that name, or if we know the street in which the person 
lives, we can pay more attention to that in the free text. Just wondering if 
any de-id tool uses this information systematically? 

Thank you all! 
Masoud

On 7/17/19, 3:01 PM, "Lingren, Todd"  wrote:

We had some similar work on de-id and "re-id".

The impact on performance for NER tasks was minimal.

https://academic.oup.com/jamia/article/20/1/84/2909298

The replacing PHI task was employed with data based on US CENSUS 
distribution.

https://www.sciencedirect.com/science/article/pii/S1532046414000161



--

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.ling...@cchmc.org
(513) 803-9032



From: Peter Szolovits 
Sent: Wednesday, July 17, 2019 1:12:21 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

My group has done considerable work on de-identification and on 
synthesizing pseudonymous data to replace the original PHI with plausible but 
inauthentic data (sometimes confusingly called re-identification).

One conclusion I reached from that work is that the de-identification and 
the pseudonym generation should be tightly coupled. For example, if de-id 
replaces all people’s names by [person], then there is no way in the pseudonym 
generation to make sure that the same real person’s name is replaced by the 
same pseudonym in every occurrence, leading to much harder to interpret text.  
The same goes for other PHI categories.

I think it’s also important to keep similar formatting if the pseudonymized 
data are going to be used for NLP learning tasks.  So, for example, the format 
of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. 
Nicknames are a problem as well; if the same document also refers to Joe, and 
the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement 
for Joe should be Bob.  Gender is also tough because there are so many names 
that are either ambiguous or not in name dictionaries.

Date shifting also introduces pseudonymization problems.  For example, a 
patient admitted on December 15 may have a note saying they are expected to be 
discharged right after Christmas. If the admission date is shifted, say to 
mid-January, then retaining the discharge expectation would imply a very long 
anticipated hospital stay.

We published a paper on this topic:
https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27 
<https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27>

I also have some old Java code that deal with a few of these issues, and 
would be happy to share with anyone interested, though it’s far from production 
quality and does not address all the issues we know.

—Peter Szolovits

> On Jul 17, 2019, at 12:42 PM, Finan, Sean 
 wrote:
>
> Hi All,
>
> ctakes-scrubber is not in any ctakes release and it is not in the main 
repository.  It never went beyond experimental and resides within the ctakes 
sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/ 
<https://svn.apache.org/repos/asf/ctakes/sandbox/>
>
> From what I recall, scrubber does not have "real" name replacement, but 
instead de-iden

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

2019-07-19 Thread Lingren, Todd
Hi Masoud,

The replacement was the same within a note, but not standardized across the 
complete record for a patient. Date shifting was also within a note, not across 
a record. The NER task doesn't really matter in this regard, and even for more 
extensive time-series info extraction/prediction, that shouldn't be relying on 
PHI anyway.

One other point about addresses, we obfuscated the road type. For example if 
the address said 123 Main Street, we would change that to 429 First Avenue, or 
something like that. And woudn't use Main Street (only Main 
Avenue/Road/Drive/Boulevard) in other replacements.



--

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.ling...@cchmc.org
(513) 803-9032



From: Masoud Rouhizadeh 
Sent: Thursday, July 18, 2019 12:27:41 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Thanks, everyone, for their great feedback. Very helpful insights!

Here are a few comments and questions:

(1) Peter: great paper! I agree that replacing the same real person’s name by 
the same pseudonym makes the text easier to interpret but on the other hand, 
wouldn't it make the de-identification less robust? I think if we pick a random 
pseudonym in each instance, it would be difficult to find the real name (in 
case it is missed by the de-id system) when it is surrounded by (lots of) 
pseudonyms.

(2) Peter: I'd appreciate if you could share your code. That would be helpful 
indeed.

(3) Todd: in your work, did you replace the same real person’s name by the same 
pseudonym across the note or you assigned a random name each time?

(4) Date shifting can be complicated. In addition to the admission case that 
Peter pointed out, we would need to deal with consistency. Will shifting the 
date by a random yet consistent number across that single note is sufficient or 
should we do this at the patient level? For instance, if some signs and 
symptoms observed and reported 1 year before the diagnosis, this trajectory 
should be preserved. Age would be another issue. Some risk factors are 
age-specific.

(5) Does anyone have any thoughts of using metadata from structured fields 
(e.g. name, DOB, SSN, contact info) to help the note de-identification system? 
if the note de-id system is aware of the person's real name, we could make it 
more sensitive to that name, or if we know the street in which the person 
lives, we can pay more attention to that in the free text. Just wondering if 
any de-id tool uses this information systematically?

Thank you all!
Masoud

On 7/17/19, 3:01 PM, "Lingren, Todd"  wrote:

We had some similar work on de-id and "re-id".

The impact on performance for NER tasks was minimal.

https://academic.oup.com/jamia/article/20/1/84/2909298

The replacing PHI task was employed with data based on US CENSUS 
distribution.

https://www.sciencedirect.com/science/article/pii/S1532046414000161



--

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.ling...@cchmc.org
(513) 803-9032



From: Peter Szolovits 
Sent: Wednesday, July 17, 2019 1:12:21 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

My group has done considerable work on de-identification and on 
synthesizing pseudonymous data to replace the original PHI with plausible but 
inauthentic data (sometimes confusingly called re-identification).

One conclusion I reached from that work is that the de-identification and 
the pseudonym generation should be tightly coupled. For example, if de-id 
replaces all people’s names by [person], then there is no way in the pseudonym 
generation to make sure that the same real person’s name is replaced by the 
same pseudonym in every occurrence, leading to much harder to interpret text.  
The same goes for other PHI categories.

I think it’s also important to keep similar formatting if the pseudonymized 
data are going to be used for NLP learning tasks.  So, for example, the format 
of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. 
Nicknames are a problem as well; if the same document also refers to Joe, and 
the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement 
for Joe should be Bob.  Gender is also tough because there are so many names 
that are either ambiguous or not in name dictionaries.

Date shifting also introduces pseudonymization problems.  For example, a 
patient admitted on December 15 may have a note saying they are expected to be 
discharged right after Christmas. If the admission date is shifted, say to 
mid-January, then retaining the discharge expectation would imply a very long 
anticipated hospi

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

2019-07-19 Thread Finan, Sean
Hi all,

Replacement consistency on the patient vs. note level may or may not be 
important depending upon the project.

For instance, If you need to place patients in a corpus on a timeline, then it 
is definitely necessary to be consistent across patients - not just consistent 
with names and dates across a patient, but also with unique names across a 
corpus if there are no other (deid'd) unique identifiers.  For other 
corpus-wide groupings, like diagnosis counts, correlation counts, etc. then 
consistency is important.

I should also point out that time shifting severity can be very important 
depending upon the project.  For instance, headaches occur at any age, so if 
that is the study focus then a +/- 5 year shift may be ok.  However, a study 
related to something occurring during or around puberty may require a more 
narrow shift, say +/- 1 year.  Also consider date shift amounts for studies 
involving any drugs that would have been new or discontinued during the time 
studied, changes in diagnostic criteria, etc.  It may be necessary to use 
smaller shifts or maybe only shifts forward and not backward, etc.

Lastly, something (like an internal quality control run?) -could- even require 
physician deids to be consistent across the entire corpus (not just 
per-patient).   This is a really special case, but it could happen.

At any rate, the point is just that deid should not be viewed simple from any 
angle.

Sean

From: Lingren, Todd 
Sent: Friday, July 19, 2019 10:27 AM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Hi Masoud,

The replacement was the same within a note, but not standardized across the 
complete record for a patient. Date shifting was also within a note, not across 
a record. The NER task doesn't really matter in this regard, and even for more 
extensive time-series info extraction/prediction, that shouldn't be relying on 
PHI anyway.

One other point about addresses, we obfuscated the road type. For example if 
the address said 123 Main Street, we would change that to 429 First Avenue, or 
something like that. And woudn't use Main Street (only Main 
Avenue/Road/Drive/Boulevard) in other replacements.



--

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.ling...@cchmc.org
(513) 803-9032



From: Masoud Rouhizadeh 
Sent: Thursday, July 18, 2019 12:27:41 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Thanks, everyone, for their great feedback. Very helpful insights!

Here are a few comments and questions:

(1) Peter: great paper! I agree that replacing the same real person’s name by 
the same pseudonym makes the text easier to interpret but on the other hand, 
wouldn't it make the de-identification less robust? I think if we pick a random 
pseudonym in each instance, it would be difficult to find the real name (in 
case it is missed by the de-id system) when it is surrounded by (lots of) 
pseudonyms.

(2) Peter: I'd appreciate if you could share your code. That would be helpful 
indeed.

(3) Todd: in your work, did you replace the same real person’s name by the same 
pseudonym across the note or you assigned a random name each time?

(4) Date shifting can be complicated. In addition to the admission case that 
Peter pointed out, we would need to deal with consistency. Will shifting the 
date by a random yet consistent number across that single note is sufficient or 
should we do this at the patient level? For instance, if some signs and 
symptoms observed and reported 1 year before the diagnosis, this trajectory 
should be preserved. Age would be another issue. Some risk factors are 
age-specific.

(5) Does anyone have any thoughts of using metadata from structured fields 
(e.g. name, DOB, SSN, contact info) to help the note de-identification system? 
if the note de-id system is aware of the person's real name, we could make it 
more sensitive to that name, or if we know the street in which the person 
lives, we can pay more attention to that in the free text. Just wondering if 
any de-id tool uses this information systematically?

Thank you all!
Masoud

On 7/17/19, 3:01 PM, "Lingren, Todd"  wrote:

We had some similar work on de-id and "re-id".

The impact on performance for NER tasks was minimal.


https://urldefense.proofpoint.com/v2/url?u=https-3A__academic.oup.com_jamia_article_20_1_84_2909298&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=xzS_XK0gt4YopRVPglqHrHUhwt2S30J1416oTrTG75g&e=

The replacing PHI task was employed with data based on US CENSUS 
distribution.


https://urldefense.proofpoint.com/v2/ur

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

2019-07-23 Thread Masoud Rouhizadeh
Hi all,

Thank you so much for your great feedback. I've learned a lot form these 
real-word, hands-on research insights on de-id. 

This is clearly not very related to cTAKES anymore, and I don't want to spam 
cTAKES dev mailing list. I'm wondering is there any other mailing list where we 
could have these types of discussions? 

Thanks,
Masoud


On 7/19/19, 11:24 AM, "Finan, Sean"  wrote:

Hi all,

Replacement consistency on the patient vs. note level may or may not be 
important depending upon the project.

For instance, If you need to place patients in a corpus on a timeline, then 
it is definitely necessary to be consistent across patients - not just 
consistent with names and dates across a patient, but also with unique names 
across a corpus if there are no other (deid'd) unique identifiers.  For other 
corpus-wide groupings, like diagnosis counts, correlation counts, etc. then 
consistency is important.

I should also point out that time shifting severity can be very important 
depending upon the project.  For instance, headaches occur at any age, so if 
that is the study focus then a +/- 5 year shift may be ok.  However, a study 
related to something occurring during or around puberty may require a more 
narrow shift, say +/- 1 year.  Also consider date shift amounts for studies 
involving any drugs that would have been new or discontinued during the time 
studied, changes in diagnostic criteria, etc.  It may be necessary to use 
smaller shifts or maybe only shifts forward and not backward, etc.

Lastly, something (like an internal quality control run?) -could- even 
require physician deids to be consistent across the entire corpus (not just 
per-patient).   This is a really special case, but it could happen.

At any rate, the point is just that deid should not be viewed simple from 
any angle.

Sean

From: Lingren, Todd 
Sent: Friday, July 19, 2019 10:27 AM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Hi Masoud,

The replacement was the same within a note, but not standardized across the 
complete record for a patient. Date shifting was also within a note, not across 
a record. The NER task doesn't really matter in this regard, and even for more 
extensive time-series info extraction/prediction, that shouldn't be relying on 
PHI anyway.

One other point about addresses, we obfuscated the road type. For example 
if the address said 123 Main Street, we would change that to 429 First Avenue, 
or something like that. And woudn't use Main Street (only Main 
Avenue/Road/Drive/Boulevard) in other replacements.



--

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.ling...@cchmc.org
(513) 803-9032



From: Masoud Rouhizadeh 
Sent: Thursday, July 18, 2019 12:27:41 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Thanks, everyone, for their great feedback. Very helpful insights!

Here are a few comments and questions:

(1) Peter: great paper! I agree that replacing the same real person’s name 
by the same pseudonym makes the text easier to interpret but on the other hand, 
wouldn't it make the de-identification less robust? I think if we pick a random 
pseudonym in each instance, it would be difficult to find the real name (in 
case it is missed by the de-id system) when it is surrounded by (lots of) 
pseudonyms.

(2) Peter: I'd appreciate if you could share your code. That would be 
helpful indeed.

(3) Todd: in your work, did you replace the same real person’s name by the 
same pseudonym across the note or you assigned a random name each time?

(4) Date shifting can be complicated. In addition to the admission case 
that Peter pointed out, we would need to deal with consistency. Will shifting 
the date by a random yet consistent number across that single note is 
sufficient or should we do this at the patient level? For instance, if some 
signs and symptoms observed and reported 1 year before the diagnosis, this 
trajectory should be preserved. Age would be another issue. Some risk factors 
are age-specific.

(5) Does anyone have any thoughts of using metadata from structured fields 
(e.g. name, DOB, SSN, contact info) to help the note de-identification system? 
if the note de-id system is aware of the person's real name, we could make it 
more sensitive to that name, or if we know the street in which the person 
lives, we can pay more attention to that in the free text. Just wondering if 
any de-id tool uses this information systematically?

Thank you all!
Masoud