I can't thank you enough! Kind Regards,
Eugenia -----Original Message----- From: Finan, Sean <sean.fi...@childrens.harvard.edu> Sent: 09 December 2020 16:49 To: dev@ctakes.apache.org Subject: Re: Disambiguation --alignment with SNOMED [EXTERNAL] Hi Eugenia, I think that I actually have code scattered about that can help a lot of this. It isn't checked in and I will need to shove some things around to make it fully-ctakes-compatible. I can't do anything right now, but since this seems to be pretty urgent for you I will start putting things together after work. Sean ________________________________________ From: Monogyiou, Eugenia <eugenia.monogy...@nttdata.com> Sent: Wednesday, December 9, 2020 11:32 AM To: dev@ctakes.apache.org Subject: RE: Disambiguation --alignment with SNOMED [EXTERNAL] * External Email - Caution * Many thanks for the support Sean. So let me explain myself a bit . Ideally I would need all the key entities so Medications , Diseases , Sign/Symptom and Procedures, Labs extracted from a set of clinical letters. These letters are not clinical notes, i.e. they have paragraphs with titles that tell you what type of content to expect in the paragraph but there is no timeline of events described in the letter and you cannot associate entities events via cause and effect type of relationships. They refer to heart attack/heart disease cohorts only. The absolute priority at the moment are Medications and Diseases, would be great if I could get the medications dosages which , when actually described, are always in different formats and 99% of these formats cannot be found in the measurement files in the source code. The other key concern is disambiguation (since I can't make YTEX to work) so I am thinking of other ways to apply disambiguation (e.g. lexicon expansion, pre-processing of certain acronyms etc.). And lastly but very importantly I need to create an evaluation process against a gold standard dataset (are there templates/ code included in the trunk that do evaluation?) Any feedback is greatly appreciated :) Kind Regards, Eugenia -----Original Message----- From: Finan, Sean <sean.fi...@childrens.harvard.edu> Sent: 09 December 2020 14:23 To: dev@ctakes.apache.org Subject: Re: Disambiguation --alignment with SNOMED [EXTERNAL] Hi Eugenia, I don't know that anybody on the devlist regularly uses the org.hsqldb.util.DatabaseManager tool and there might be a better forum for questions on that topic. We could take a step back here and see if there might be more direct ways to address your efforts. By that I mean perhaps we can look at the larger picture and come up with some single solution to many smaller problems. What exactly are you trying to extract from your documents? Do you have a certain clinical domain or certain clinical elements that interest you the most? Are you only interested in entities from a single vocabulary? Thanks, Sean ________________________________________ From: Monogyiou, Eugenia <eugenia.monogy...@nttdata.com> Sent: Wednesday, December 9, 2020 5:52 AM To: dev@ctakes.apache.org Subject: RE: Disambiguation --alignment with SNOMED [EXTERNAL] * External Email - Caution * Many thanks for the suggestion. Before I use the sample tool I tried the hsqldb manager and the results were surprising. Please bear with me because I am really confused... I copied the hsqldb jar where my dict script and properties files are and then I navigated there and run the following commands java -cp hsqldb-2.3.4.jar org.hsqldb.util.DatabaseManager (the gui launched successfully and user was set to SA) and then set the URL to jdbc:hsqldb:\apache-ctakes-4.0.0\resources\org\apache\ctakes\dictionary\lookup\fast\sno_rx_16ab\sno_rx_16ab -- connection established successfully. However , the tree is empty! No schema no tables --- if I open the script file all the relevant SQL commands are there to create cui_terms, prefterm tables etc. etc. Any idea what I am doing wrong? Is it possible that my installation was problematic or is it a matter of configuration? I have used many pipelines and I am getting many annotations from different coding schemes across all the key entities , I can even see in the command line when the hsqldb is accessed when I run the pipelines so I must be missing something here? Thank you for your patience with me ! Kind Regards, Eugenia Monogyiou | NTT Data UK Consulting & IT Solutions Ltd. 1 Royal Exchange, London EC3V 3DG Mob: +44 (0)7971623683 Email: eugenia.monogy...@nttdata.com -----Original Message----- From: Finan, Sean <sean.fi...@childrens.harvard.edu> Sent: 08 December 2020 19:23 To: dev@ctakes.apache.org Subject: Re: Disambiguation --alignment with SNOMED [EXTERNAL] Hi Eugenia, Within the past few years people have made some user-friendly tools. Check out: https://urldefense.proofpoint.com/v2/url?u=https-3A__razorsql.com_features_hsqldb-5Fgui-5Ftools.html&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=mEsAu8ltaPsSc6-L0Tmb1ebbALPXcZ65moWihkB6ypk&s=qP7jn08uKkzKE4-M62frNWUXAd5JFXaB5JJZsVpy8z4&e= After you launch it (30 day trial is free), create a new connection as the first panel indicates. - Type any name for the connection (line #1) - for the Administrator, type "SA" - point to the sno_rx_16ab.script file (or better yet, a copy that you can play with) You are done with those settings. Opening the db will take a few seconds. In the new main panel, in the tree on the left select Project > PUBLIC > Tables You should see the tables that are relevant to ctakes. Right-click on one of the tables and select "Edit" A panel should pop up with the table. You should be able to edit the columns and rows. There is also a gui that the hsqldb people make, but it is a little primitive. https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hsqldb.org_doc_2.0_guide_running-2Dchapt.html-23rgc-5Faccess-5Ftools&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=mEsAu8ltaPsSc6-L0Tmb1ebbALPXcZ65moWihkB6ypk&s=bmgP2Ljg9HgF3lusL1at9CWngr1FlY6UCBiBmU1PSIk&e= You can find instructions online such as: https://urldefense.proofpoint.com/v2/url?u=https-3A__waqasaslam.me_2019_06_24_how-2Dto-2Dview-2Dhsql-2Ddb-2Din-2Da-2Dgui-2Dhsql-2Ddatabase-2Dmanager-2Dfor-2Dsap-2Dhybris_&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=mEsAu8ltaPsSc6-L0Tmb1ebbALPXcZ65moWihkB6ypk&s=r8yhYzOUQwbxZJgqFiF09In5dJ5QulCk7UzkuZuOjkI&e= Sean ________________________________________ From: Monogyiou, Eugenia <eugenia.monogy...@nttdata.com> Sent: Tuesday, December 8, 2020 1:08 PM To: dev@ctakes.apache.org Subject: RE: Disambiguation --alignment with SNOMED [EXTERNAL] * External Email - Caution * Many thanks for your support this far I am finally shaping a decent pipeline for the data I am working on -- Chunk in the overlap annotator actually works great! I never tried to access the hsqldb before so I will have to ask instructions for that please (signal processing person here)! I looked up some advice from Sean i.e. to copy the hsql***.jar from [ctakes_root]/lib/ , navigate to the cTakes\apache-ctakes-4.0.0\resources\org\apache\ctakes\dictionary\lookup\fast\sno_rx_16ab folder on command line and type java -cp hsqldb-2.3.4.jar org.hsqldb.util.SqlTool --rcfile [sno_rx_16ab].rc [sno_rx_16ab] to open the SqlTool. The sno_rx_16ab dictionary was created with the Dictionary Creator GUI of ctakes 4.0.0 ...Problem is that there is no .rc file there ... there are .script, .properties and .lck files . I know the .script file contains the SQL commands but I prefer not to improvise and change something .. any instructions on how I can inspect the tables the tables below in hsqldb please? Thank you! Kind Regards, Eugenia -----Original Message----- From: Peter Abramowitsch <pabramowit...@gmail.com> Sent: 04 December 2020 20:03 To: dev@ctakes.apache.org Subject: Re: Disambiguation --alignment with SNOMED Hi Eugenia. I may be wrong, but that XML definition is out of date (which is why it is commented out). Through the piper mechanism you have a different choice. Here follows a bit more. I hope some of it is useful.... Highly specific identification of terms is difficult and I am working on some infrastructure to help in really capturing values - not only lab values, but it will take a long time as I'm just doing it for fun. But your problem seems more like a dictionary issue. I won't pretend to be an expert or to have tried out every possibility, but I'll give you a few tips. The important thing is to know that, for me at least, Ctakes is not a finished product but an eternal work in progress. It takes years of experimentation and configuration. First you need to understand what specific terms and contexts your physicians are using and whether the punctuation is clean enough that you can work with sentences or need to go down to the chunk level. in the UMLS Dictionary Lookup mechanism , the WindowAnnotation param is probably something you can supply in a piper file and it is the FQN of a class that extends Annotation. You could create your own Annotation & Annotator, or you could try using a Chunk annotator upstream of the UMLS lookup. The piper creator helps you do that. Then you would add the FQN of a Chunk to the window param of your UMLS lookup annotator. I used it a long time ago and from what I remember it basically tries to identify clauses within sentences. By doing this - especially with the Overlap Annotator, you'd prevent spilling the lookup across clauses within a sentence. You may want to play with the SentenceDetectorAnnotatorBIO instead of the SentenceDetector to see which gets you the most workable sentences. And you may want to look at this file EndOfSentenceScannerImpl.java Customizing the dictionary usually means adding a synonym for each wording that represents context in which your term will be found. Now in your specific example about a monocyte procedure vs a monocyte count result, these are not just distinct in SNOMED terms but also distinct CUIs. Here are the two canonical terms with their CUIS as I found them, then each has its synonyms. As you can see that these SYNONYMS are woefully insufficient and not only have the synonyms blurred the distinction you were looking for, but the SNOMED mapping overlaps the two concepts. This was probably done as an expedient, but from an informatics perspective, you are right. This is incorrect. INSERT INTO PREFTERM VALUES(750880,'Monocyte count result') (TUI 34) SYNONYMS count monocytes, SNO *365631001* INSERT INTO PREFTERM VALUES(200637,'Monocyte count procedure') (TUI 59) SYNONYM monos, monocyte count SNO 67776007, *365631001* Check out how a row like this works. INSERT INTO CUI_TERMS VALUES(CUI,INDEX,COUNT,'<context with keyword>','<keyword>') You can add these rows to match the language used by your physicians or in your forms. I had to do a fair bit of juggling to get what we needed and it's a job that's never finished. The way I save my changes is to produce sed files of deletions, changes, additions made to the standard dictionary, and archive those rather than the dictionary which is quite large I hope this helps. Peter On Fri, Dec 4, 2020 at 5:07 PM Monogyiou, Eugenia < eugenia.monogy...@nttdata.com> wrote: > Thank you all for the support! > Sean, Kean the labValueFinder works as described so thanks for > pointing that out! > > Peter, I will ask for your help with the LookupWindow if you could > please spare a bit more time... I have located the > UmlsOverlapLookupAnnotator file, thank you for that. > > I have located in the UmlsLookupAnnotator (in > ctakes-dictionary-lookup-fast) > <name>windowAnnotations</name> > <value> > <!-- LookupWindowAnnotation is supposed to be a > refined Noun Phrase --> > > > <!--<string>org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation</string>--> > <!-- In some instances LookupWindowAnnotation is > missing tokens and Sentence can be used --> > > <string>org.apache.ctakes.typesystem.type.textspan.Sentence</string> > </value> > > I have gone through various java and typesystem files but I am not > sure where I can find all the potential options for the Lookup Window > and where/how I can set these. Also, if you could please let me know > where in the code it is possible to see what symbols are considered > "end-of sentence". I have noticed that ":" sometimes defines the end > of a sentence but I haven't located anything relevant in the code ... > > Peter says : > > > Sometimes you need to further customize your dictionary. (can you > please elaborate ?) > > Many thanks in advance, > > Kind Regards, > > Eugenia Monogyiou | NTT Data UK > Consulting & IT Solutions Ltd. 1 Royal Exchange, London EC3V 3DG > > Mob: +44 (0)7971623683 Email: eugenia.monogy...@nttdata.com > > > -----Original Message----- > From: Peter Abramowitsch <pabramowit...@gmail.com> > Sent: 03 December 2020 18:54 > To: dev@ctakes.apache.org > Subject: Re: Disambiguation --alignment with SNOMED > > I have this issue a lot. There are many moving parts. Sometimes it can > be resolved by using the widest window in the DictionaryLookup or > sometimes the TermOverlap lookup annotator. Sometimes you need to > further customize your dictionary. > > The problem arises when there isn't enough context to whittle down the > lookup to the correct SNOMED entity. Or there isn't a synonym entry in the > Dictionary that maps to the widest context in your texts. If you look at > how the UMLS SNO_RX dictionary is structured you'll see how it can happen. > > For starters, look at the raw XMI and see all the entries in the > UmlsArray that were selected even if later, only the wrong one entry surfaced. > > Another issue is the LabValueFinder. It has settings that allow it to > clone procedures into lab values or vice versa (I can't remember). > This can lead to a lot of duplication > > Peter > > On Thu, Dec 3, 2020 at 2:23 PM Monogyiou, Eugenia < > eugenia.monogy...@nttdata.com> wrote: > > > Hello, > > > > I think I have hit a wall in terms of applying disambiguation in the > > cTakes context. I have come across the following example where what > > I consider to be a lab result (Monocyte Count) is picked up as a > > procedure, apparently, in alignment with UMLS > > coding Scheme = SNOMED Code =67776007, CUI =C0200637 , TUI =T059 > > , preferredText = " Monocyte Count Procedure" > > coding Scheme = SNOMED Code =365631001, CUI =C0200637 , TUI =T059 > , > > preferredText = " Monocyte Count Procedure" > > > > While they share the CUI (at UMLS level, due to the reconciliation > > of different ontologies), they are quite different concepts. > > 67776007 stands for "Monocyte count (procedure)" while 365631001 > > stands for "Finding of monocyte count (finding)". So is it fair to > > say that cTakes is not fully aligned with SNOMED? Is there a rule > > on how such concepts may be merged under the same CUI? Would using > > YTEX resolve > similar issues? > > > > And also I'm using cTakes 4.0.0 and the YTEX installation guide > > appears to be outdated - the patch download is missing , names of > > files > missing etc. > > If YTEX is the answer are there any updated instructions? If it is > > not are you using other UIMA-friendly solutions? > > > > Many thanks in advance, > > Eugenia > > > > Disclaimer: This email and any attachments are sent in strictest > > confidence for the sole use of the addressee and may contain legally > > privileged, confidential, and proprietary data. If you are not the > > intended recipient, please advise the sender by replying promptly to > > this email and then delete and destroy this email and any > > attachments without any further use, copying or forwarding. > > > Disclaimer: This email and any attachments are sent in strictest > confidence for the sole use of the addressee and may contain legally > privileged, confidential, and proprietary data. If you are not the > intended recipient, please advise the sender by replying promptly to > this email and then delete and destroy this email and any attachments > without any further use, copying or forwarding. > Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding. Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding. Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding. Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.