Britt and I (ctakes committers) work on exactly this problem. 
We have used cTakes to train models for HIPAA de-identification. 

In a nutshell: the answer depends what the IRB considers "de-identified". 
Hashing is not allowed by any IRB that I am aware of. 

On May 18, 2013, at 12:43 PM, Alexander Measure <[email protected]> wrote:

> In my day job I train text classifiers that are useful for a wide variety
> of health surveillance tasks. The data used to train these classifiers
> however cannot be shared because of confidentiality protections.  I would
> like to make these trained models available to others just as cTAKES does,
> but I'm not sure how. Can you tell me how cTAKES does it, or point me to
> resources that might be useful?
> 
> My models tend to be regularized logistic regression models trained on
> bag-of-words type features. I suspect that I can get some protection by
> hashing everything to a fixed space first, but if there's a different
> well-established approach out there I'd rather use that.
> 
> Alex Measure

Reply via email to