Yes, one batch! We had no problems with 10 K plain text files.

On Tue, Jan 29, 2019 at 3:33 PM Baas,Leah <[email protected]>
wrote:

> Ah, I see. Yes—I will change the pre-processing step to write plaintext
> instead of xml files. Thank you so much for the tip!
>
>
>
> Once I’ve fixed the pre-processing code, do you anticipate that I should
> be able to process all of the input files in one batch?
>
>
>
> Leah
>
>
>
> *From: *"Miller, Timothy" <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Tuesday, January 29, 2019 at 3:23 PM
> *To: *"Baas,Leah" <[email protected]>, "[email protected]"
> <[email protected]>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> OK, if you can see xml tags in the right pane, that means that ctakes is
> trying to process the xml markup as well as the text. Can you change your
> python pre-process to just write plaintext files with only the text from
> the note, and not xml? And then process that? I think there are probably
> cases where having xml in the text would confuse some of the  modules and
> cause them to run slowly. You also will get weird outputs, I've seen
> "<span>" get annotated as a "body measurement finding" when we accidentally
> processed some html once.
>
> Tim
>
>
>
>
>
> -----Original Message-----
>
> *From*: "Baas,Leah" <[email protected]
> <%22Baas,leah%22%20%[email protected]%3e>>
>
> *To*: "Miller, Timothy" <[email protected]
> <%22Miller,%20timothy%22%20%[email protected]%3e>>,
> [email protected] <[email protected]
> <%[email protected]%22%20%[email protected]%3e>>
>
> *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL]
>
> *Date*: Tue, 29 Jan 2019 21:15:54 +0000
>
>
>
> Yes, I’ve been following those instructions to view the .xmi files in the
> CVD.  The right pane shows the text of the XML file.
>
>
>
> Leah
>
>
>
> *From: *"Miller, Timothy" <[email protected]>
> *Date: *Tuesday, January 29, 2019 at 3:00 PM
> *To: *"Baas,Leah" <[email protected]>, "[email protected]"
> <[email protected]>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> So after you process all the notes do you follow the instructions on the
> wiki page that say:
>
> You can view information in the XMI files using the UIMA Cas Visual
> Debugger (CVD).
>
>
>
> Execute bin/runctakesCVD
>
> Select File > Read Type System File
>
> Select TypeSystem.xml in resources/org/apache/ctakes/typesystem/types/
>
> Select File > Read XMI CAS File
>
> Select any .xmi file in your outputDirectory
>
>
>
> and look at that .xmi file? If so, what do you see in the right pane? The
> text of the note or the text of an xml file?
>
> Tim
>
>
>
>
>
> -----Original Message-----
>
> *From*: "Baas,Leah" <[email protected]
> <%22Baas,leah%22%20%[email protected]%3e>>
>
> *To*: "Miller, Timothy" <[email protected]
> <%22Miller,%20timothy%22%20%[email protected]%3e>>,
> [email protected] <[email protected]
> <%[email protected]%22%20%[email protected]%3e>>
>
> *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL]
>
> *Date*: Tue, 29 Jan 2019 20:45:58 +0000
>
>
>
> It is not CDA format. I used Python’s ElementTree module to generate XML
> files containing the clinical notes for each subject in my dataset. When I
> run the Default Clinical Pipeline, I can successfully generate XMI output
> files for each XML file in my input directory. The following WARNING
> message appears multiple times over the course of the processing (not sure
> if this is at all related to the issue at hand):
>
>
>
> Jan 29, 2019 2:02:56 PM org.apache.uima.util.MessageReport
> decreasingWithTrace(51)
>
> WARNING: Message count: 1; Feature
> org.apache.ctakes.typesystem.type.textsem.Predicate:relations is marked
> multipleReferencesAllowed=false, but it has multiple references.  These
> will be serialized in duplicate. Message count indicates messages skipped
> to avoid potential flooding. Set FINE logging level for stacktrace.
>
>
>
> Leah
>
>
>
> *From: *"Miller, Timothy" <[email protected]>
> *Date: *Tuesday, January 29, 2019 at 2:28 PM
> *To: *"Baas,Leah" <[email protected]>, "[email protected]"
> <[email protected]>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> Well if you're processing XML files that will likely cause a problem with
> this script, it's expecting plain text files in a directory. Maybe Sean can
> chime in on whether it's possible to use an XML collection reader with the
> runClinicalPipeline.sh script? Is it CDA format?
>
> Tim
>
>
>
> -----Original Message-----
>
> *From*: "Baas,Leah" <[email protected]
> <%22Baas,leah%22%20%[email protected]%3e>>
>
> *To*: "Miller, Timothy" <[email protected]
> <%22Miller,%20timothy%22%20%[email protected]%3e>>,
> [email protected] <[email protected]
> <%[email protected]%22%20%[email protected]%3e>>
>
> *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL]
>
> *Date*: Tue, 29 Jan 2019 20:21:17 +0000
>
>
>
> Hi Tim,
>
>
>
> Thanks again for working through this with me. I hadn’t read through the
> time stamps carefully enough to notice the one-time cost of startup.
>
>
>
> I did replicate your setup by copying/pasting 7 of my XML input files into
> an empty directory. Here’s what I saw:
>
>
>
>    1. For the startup-- 20 seconds between the first time-stamped log
>    message:
>
> *29 Jan 2019 14:02:35  INFO SentenceDetector - Sentence detector model
> file: org/apache/ctakes/core/sentdetect/sd-med-model.zip*
>
>
>
>                 and the first log message doing processing:
>
> *29 Jan 2019 14:02:55  INFO SentenceDetector - Starting processing.*
>
>
>
>    1. Once started up, 12 seconds to process the notes.
>
> *29 Jan 2019 14:03:07  INFO ClearNLPSemanticRoleLabelerAE - Finished
> processing*
>
>
>
> Does this help narrow things down?
>
>
>
> Leah
>
>
>
> *From: *"Miller, Timothy" <[email protected]>
> *Date: *Tuesday, January 29, 2019 at 1:58 PM
> *To: *"Baas,Leah" <[email protected]>, "[email protected]"
> <[email protected]>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> I haven't used that script myself, but I just tried it now on some notes
> from mtsamples. Maybe you can try to replicate that setup? I just
> copy/pasted the 7 allergy/immunology notes [1] into 7 text files in an
> empty directory. Here's what I see:
>
>
>
> 1) It is pretty slow to start up -- but this is a one time cost (~50
> seconds). I'm looking at the time between the very first time-stamped log
> message:
>
> *29 Jan 2019 14:51:51  INFO SentenceDetector - Sentence detector model
> file: org/apache/ctakes/core/sentdetect/sd-med-model.zip*
>
>
>
> and the first log message doing processing:
>
>
>
> *29 Jan 2019 14:52:40  INFO SentenceDetector - Starting processing*
>
>
>
> 2) Once started up, it processes the notes in about 14s. This is actually
> slower than expected but this is a lot faster than you were seeing. I"m
> looking at the time between the start of processing just above and the last
> log message before it quits:
>
>
>
> *29 Jan 2019 14:52:54  INFO ClearNLPSemanticRoleLabelerAE - Finished
> processing*
>
>
>
> If you can replicate this input/output setup and approximate timing in
> your VM first, then we can see whether it's a function of your notes or
> your setup.
>
>
>
> Tim
>
>
>
>
>
> [1]
> https://www.mtsamples.com/site/pages/browse.asp?type=3-Allergy%20/%20Immunology
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mtsamples.com_site_pages_browse.asp-3Ftype-3D3-2DAllergy-2520_-2520Immunology&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=mrw9Hkq5tgV2AJpZMfTcbtAXSa2A59SwIOtsBR73mFs&s=dzNYtO-sdz1-shXn2KbCVDJQbxNh-i5mMutk0H-8ifc&e=>
>
>
>
> -----Original Message-----
>
> *From*: "Baas,Leah" <[email protected]
> <%22Baas,leah%22%20%[email protected]%3e>>
>
> *To*: [email protected] <[email protected]
> <%[email protected]%22%20%[email protected]%3e>>,
> [email protected] <[email protected]
> <%[email protected]%22%20%[email protected]%3e>
> >
>
> *Subject*: Re: Processing large batches of files in cTAKES [EXTERNAL]
>
> *Date*: Tue, 29 Jan 2019 19:33:34 +0000
>
>
>
> Hi again Tim,
>
>
>
> I am trying to check which version of the dictionary I am using when
> running the Default Clinical Pipeline. I have been running the pipeline
> according to the instructions detailed here
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_Default-2BClinical-2BPipeline&d=DwMGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=jgvtkadUTVhxxDm24op4l0wy5Gr3jtNrWgRsUw93nKs&s=-iPRvjXA71f66iWz53vhCbU6a1JqiEwWZ03YmfUPf-Y&e=>.
> However, I haven’t been able to find documentation specifying which
> dictionary version is built into this pipeline. There must be a simple way
> to check—I am just ignorant. Could you enlighten me?
>
>
>
> Thanks,
>
>
>
> Leah
>
>
>
> *From: *"Baas,Leah" <[email protected]>
> *Date: *Tuesday, January 29, 2019 at 12:23 PM
> *To: *"[email protected]" <[email protected]>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> Tim,
>
>
>
> Thanks for your quick response! Probably unsurprisingly, I’ll have to do
> some googling to learn how to check those things. If you could point me in
> the right direction, that’d be great!
>
>
>
> Thanks again,
>
>
>
> Leah
>
>
>
> *From: *"Miller, Timothy" <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Tuesday, January 29, 2019 at 12:14 PM
> *To: *"[email protected]" <[email protected]>
> *Subject: *Re: Processing large batches of files in cTAKES [EXTERNAL]
>
>
>
> I am able to process that number of files in a reasonable amount of time
> (maybe an hour) on an average desktop. Luckily, debugging your setup should
> be much easier than doing a scaleout. A few possibilities:
>
>
>
> * You are running the old (slow) dictionary instead of the new fast one
>
> * Your document has extremely long sentences
>
> * Your VM is _extremely_ resource constrained and is thrashing constantly
>
>
>
> Do you know how to check these things?
>
> Tim
>
>
>
>
>
>
>
> -----Original Message-----
>
> *From*: "Baas,Leah" <[email protected]
> <%22Baas,leah%22%20%[email protected]%3e>>
>
> Reply-to: <[email protected]>
>
> *To*: [email protected] <[email protected]
> <%[email protected]%22%20%[email protected]%3e>>
>
> *Subject*: Processing large batches of files in cTAKES [EXTERNAL]
>
> *Date*: Tue, 29 Jan 2019 17:58:48 +0000
>
>
>
> Hi all,
>
>
>
> I would like to process a batch of 13,414 files (avg file size 6.2 KB)
> using the default clinical pipeline. I am new to cTAKES and computer
> programming, and I’m looking for guidance on how to process these files
> with maximum time/CPU efficiency. I am currently running my program on an
> Ubuntu VM with 3 CPUs. It takes me 28 seconds (real time) to process one
> 6.0 KB file. I’m reading up on parallel processing strategies, but would be
> grateful for any suggestions, tips, etc. that you might have!
>
>
>
> Thanks,
>
>
>
> Leah
>
>
>
>
>
> -----------------------------------------------------------------------
> Confidentiality Notice: This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> privileged and confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy
> all copies of the original message.
>


-- 
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Cardiovascular Informatics <http://www.med.umn.edu/cardiology/>
University of Minnesota
[email protected]

 ›  evaluate-it.org  ‹

Reply via email to