Re: [I] [Question]: Initial CTakes analysis [ctakes]

via GitHub Mon, 28 Apr 2025 10:25:01 -0700


Johnsd11 commented on issue #56:
URL: https://github.com/apache/ctakes/issues/56#issuecomment-2835965082


   > Out of the box, cTakes would get you part of the way there, but would 
require several types of customization to meet your requirements. All of these 
are the kind of customizations that most of us have had to do, so there's 
nothing new here, but they are not trivial. As I see it they fall into these 
categories.
   > 
   >     1. getting familiar with the cTakes Application, pipeline, annotator 
and
   >        vocabulary ecosystem
   > 
   >     2. choosing a vocabulary subset that gives the best coverage of the 
terms
   >        you are looking for
   > 
   >     3. adding one or more custom dictionaries to add terms & synonyms that 
are
   >        not present -
   > 
   >     4. maybe employing the anatomical site annotator in your pipeline
   > 
   >     5. deciding how to harvest and structure the data you extract from the 
CAS
   >        object which all the annotators target
   > 
   >     6. decide how to deploy the application (standalone?,  webservices 
host?
   >        multi-instance? ).  Many considerations go into this and greatly 
affect
   >        ability to scale.  There is more than one architectural solution 
that will
   >        work and allow you to get to your "fully automated" goal, but you 
will need
   >        to implement that yourself.
   > 
   > 
   > A hint about highlighting the text - all annotations carry text offsets so 
with these you can write code (usually JS and CSS) to do your highlighting. 
native cTakes does not have any graphical display functionality.
   > 
   > Another hint learned from experience. If you have many large texts (say, 
20kb and above with lots of potential terms to discover), you can achieve much 
better throughput by breaking these into smaller chunks at sentence boundaries 
and tweaking offsets accordingly as you reassemble the chunks. The memory 
requirements grow rapidly with the size of the note.
   > 
   > In summary, a strong developer background is a good starting point. To 
that you'd want to add medical informatics, and experience with scalable 
architectures. cTakes is a great kernel to your system but be prepared to dive 
deep.
   
   
   Thanks for the detailed and thoughtful explanation.
   
   The easiest part for me to understand and work through would be #6. My MO 
for this sort of thing with both currently used in the existing target system 
are Windows services with associated DB queues and DLLs called from the 
application. The former for items which are not needed as part of the "real 
time" application and the latter for those which are.
   
   I currently have a homegrown application which looks for keywords and 
negation modifiers within a certain distance from the keywords which works 
moderately well. 
   
   My ignorance regarding NLP algorithms like CTakes is whether it is keyword 
driven, or it is self learning. If it is the latter, I have a fairly large 
collection of human curated data which I could feed a training module.
   
   Where can I find an "executive overview" (30,000 foot view) of how the 
CTakes works?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Question]: Initial CTakes analysis [ctakes]

Reply via email to