Yes, one batch! We had no problems with 10 K plain text files. On Tue, Jan 29, 2019 at 3:33 PM Baas,Leah <[email protected]> wrote:
> Ah, I see. Yes—I will change the pre-processing step to write plaintext > instead of xml files. Thank you so much for the tip! > > > > Once I’ve fixed the pre-processing code, do you anticipate that I should > be able to process all of the input files in one batch? > > > > Leah > > > > *From: *"Miller, Timothy" <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Tuesday, January 29, 2019 at 3:23 PM > *To: *"Baas,Leah" <[email protected]>, "[email protected]" > <[email protected]> > *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL] > > > > OK, if you can see xml tags in the right pane, that means that ctakes is > trying to process the xml markup as well as the text. Can you change your > python pre-process to just write plaintext files with only the text from > the note, and not xml? And then process that? I think there are probably > cases where having xml in the text would confuse some of the modules and > cause them to run slowly. You also will get weird outputs, I've seen > "<span>" get annotated as a "body measurement finding" when we accidentally > processed some html once. > > Tim > > > > > > -----Original Message----- > > *From*: "Baas,Leah" <[email protected] > <%22Baas,leah%22%20%[email protected]%3e>> > > *To*: "Miller, Timothy" <[email protected] > <%22Miller,%20timothy%22%20%[email protected]%3e>>, > [email protected] <[email protected] > <%[email protected]%22%20%[email protected]%3e>> > > *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL] > > *Date*: Tue, 29 Jan 2019 21:15:54 +0000 > > > > Yes, I’ve been following those instructions to view the .xmi files in the > CVD. The right pane shows the text of the XML file. > > > > Leah > > > > *From: *"Miller, Timothy" <[email protected]> > *Date: *Tuesday, January 29, 2019 at 3:00 PM > *To: *"Baas,Leah" <[email protected]>, "[email protected]" > <[email protected]> > *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL] > > > > So after you process all the notes do you follow the instructions on the > wiki page that say: > > You can view information in the XMI files using the UIMA Cas Visual > Debugger (CVD). > > > > Execute bin/runctakesCVD > > Select File > Read Type System File > > Select TypeSystem.xml in resources/org/apache/ctakes/typesystem/types/ > > Select File > Read XMI CAS File > > Select any .xmi file in your outputDirectory > > > > and look at that .xmi file? If so, what do you see in the right pane? The > text of the note or the text of an xml file? > > Tim > > > > > > -----Original Message----- > > *From*: "Baas,Leah" <[email protected] > <%22Baas,leah%22%20%[email protected]%3e>> > > *To*: "Miller, Timothy" <[email protected] > <%22Miller,%20timothy%22%20%[email protected]%3e>>, > [email protected] <[email protected] > <%[email protected]%22%20%[email protected]%3e>> > > *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL] > > *Date*: Tue, 29 Jan 2019 20:45:58 +0000 > > > > It is not CDA format. I used Python’s ElementTree module to generate XML > files containing the clinical notes for each subject in my dataset. When I > run the Default Clinical Pipeline, I can successfully generate XMI output > files for each XML file in my input directory. The following WARNING > message appears multiple times over the course of the processing (not sure > if this is at all related to the issue at hand): > > > > Jan 29, 2019 2:02:56 PM org.apache.uima.util.MessageReport > decreasingWithTrace(51) > > WARNING: Message count: 1; Feature > org.apache.ctakes.typesystem.type.textsem.Predicate:relations is marked > multipleReferencesAllowed=false, but it has multiple references. These > will be serialized in duplicate. Message count indicates messages skipped > to avoid potential flooding. Set FINE logging level for stacktrace. > > > > Leah > > > > *From: *"Miller, Timothy" <[email protected]> > *Date: *Tuesday, January 29, 2019 at 2:28 PM > *To: *"Baas,Leah" <[email protected]>, "[email protected]" > <[email protected]> > *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL] > > > > Well if you're processing XML files that will likely cause a problem with > this script, it's expecting plain text files in a directory. Maybe Sean can > chime in on whether it's possible to use an XML collection reader with the > runClinicalPipeline.sh script? Is it CDA format? > > Tim > > > > -----Original Message----- > > *From*: "Baas,Leah" <[email protected] > <%22Baas,leah%22%20%[email protected]%3e>> > > *To*: "Miller, Timothy" <[email protected] > <%22Miller,%20timothy%22%20%[email protected]%3e>>, > [email protected] <[email protected] > <%[email protected]%22%20%[email protected]%3e>> > > *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL] > > *Date*: Tue, 29 Jan 2019 20:21:17 +0000 > > > > Hi Tim, > > > > Thanks again for working through this with me. I hadn’t read through the > time stamps carefully enough to notice the one-time cost of startup. > > > > I did replicate your setup by copying/pasting 7 of my XML input files into > an empty directory. Here’s what I saw: > > > > 1. For the startup-- 20 seconds between the first time-stamped log > message: > > *29 Jan 2019 14:02:35 INFO SentenceDetector - Sentence detector model > file: org/apache/ctakes/core/sentdetect/sd-med-model.zip* > > > > and the first log message doing processing: > > *29 Jan 2019 14:02:55 INFO SentenceDetector - Starting processing.* > > > > 1. Once started up, 12 seconds to process the notes. > > *29 Jan 2019 14:03:07 INFO ClearNLPSemanticRoleLabelerAE - Finished > processing* > > > > Does this help narrow things down? > > > > Leah > > > > *From: *"Miller, Timothy" <[email protected]> > *Date: *Tuesday, January 29, 2019 at 1:58 PM > *To: *"Baas,Leah" <[email protected]>, "[email protected]" > <[email protected]> > *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL] > > > > I haven't used that script myself, but I just tried it now on some notes > from mtsamples. Maybe you can try to replicate that setup? I just > copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an > empty directory. Here's what I see: > > > > 1) It is pretty slow to start up -- but this is a one time cost (~50 > seconds). I'm looking at the time between the very first time-stamped log > message: > > *29 Jan 2019 14:51:51 INFO SentenceDetector - Sentence detector model > file: org/apache/ctakes/core/sentdetect/sd-med-model.zip* > > > > and the first log message doing processing: > > > > *29 Jan 2019 14:52:40 INFO SentenceDetector - Starting processing* > > > > 2) Once started up, it processes the notes in about 14s. This is actually > slower than expected but this is a lot faster than you were seeing. I"m > looking at the time between the start of processing just above and the last > log message before it quits: > > > > *29 Jan 2019 14:52:54 INFO ClearNLPSemanticRoleLabelerAE - Finished > processing* > > > > If you can replicate this input/output setup and approximate timing in > your VM first, then we can see whether it's a function of your notes or > your setup. > > > > Tim > > > > > > [1] > https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=> > > > > -----Original Message----- > > *From*: "Baas,Leah" <[email protected] > <%22Baas,leah%22%20%[email protected]%3e>> > > *To*: [email protected] <[email protected] > <%[email protected]%22%20%[email protected]%3e>>, > [email protected] <[email protected] > <%[email protected]%22%20%[email protected]%3e> > > > > *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL] > > *Date*: Tue, 29 Jan 2019 19:33:34 +0000 > > > > Hi again Tim, > > > > I am trying to check which version of the dictionary I am using when > running the Default Clinical Pipeline. I have been running the pipeline > according to the instructions detailed here > <https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>. > However, I haven’t been able to find documentation specifying which > dictionary version is built into this pipeline. There must be a simple way > to check—I am just ignorant. Could you enlighten me? > > > > Thanks, > > > > Leah > > > > *From: *"Baas,Leah" <[email protected]> > *Date: *Tuesday, January 29, 2019 at 12:23 PM > *To: *"[email protected]" <[email protected]> > *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL] > > > > Tim, > > > > Thanks for your quick response! Probably unsurprisingly, I’ll have to do > some googling to learn how to check those things. If you could point me in > the right direction, that’d be great! > > > > Thanks again, > > > > Leah > > > > *From: *"Miller, Timothy" <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Tuesday, January 29, 2019 at 12:14 PM > *To: *"[email protected]" <[email protected]> > *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL] > > > > I am able to process that number of files in a reasonable amount of time > (maybe an hour) on an average desktop. Luckily, debugging your setup should > be much easier than doing a scaleout. A few possibilities: > > > > * You are running the old (slow) dictionary instead of the new fast one > > * Your document has extremely long sentences > > * Your VM is _extremely_ resource constrained and is thrashing constantly > > > > Do you know how to check these things? > > Tim > > > > > > > > -----Original Message----- > > *From*: "Baas,Leah" <[email protected] > <%22Baas,leah%22%20%[email protected]%3e>> > > Reply-to: <[email protected]> > > *To*: [email protected] <[email protected] > <%[email protected]%22%20%[email protected]%3e>> > > *Subject*: Processing large batches of files in cTAKES [EXTERNAL] > > *Date*: Tue, 29 Jan 2019 17:58:48 +0000 > > > > Hi all, > > > > I would like to process a batch of 13,414 files (avg file size 6.2 KB) > using the default clinical pipeline. I am new to cTAKES and computer > programming, and I’m looking for guidance on how to process these files > with maximum time/CPU efficiency. I am currently running my program on an > Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one > 6.0 KB file. I’m reading up on parallel processing strategies, but would be > grateful for any suggestions, tips, etc. that you might have! > > > > Thanks, > > > > Leah > > > > > > ----------------------------------------------------------------------- > Confidentiality Notice: This e-mail message, including any attachments, > is for the sole use of the intended recipient(s) and may contain > privileged and confidential information. Any unauthorized review, use, > disclosure or distribution is prohibited. If you are not the intended > recipient, please contact the sender by reply e-mail and destroy > all copies of the original message. > -- Greg M. Silverman Senior Systems Developer NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group> Cardiovascular Informatics <http://www.med.umn.edu/cardiology/> University of Minnesota [email protected] › evaluate-it.org ‹
