Thanks again Sean, that is all very helpful. On Thu, Mar 28, 2019 at 4:20 PM Finan, Sean < sean.fi...@childrens.harvard.edu> wrote:
> Hi Jeff, > > > 1) do you think it might not crash yet produce unreliable results when > using the components in the DefaultClinicalPipeline? > > -- I am pretty certain that you would get unreliable results. I seem to > recall attempts with the default pipeline crashing, but with a small corpus > one could get lucky. > > > 2) Do you have any more information about [Spark] > > -- No, not really. I don't work with it, I am just regurgitating from > memory things read or heard. > > > 3) In the TS pipelines, what does the "threads" keyword ... > > -- "threads" specifies how many threads share a single pipeline. > -- All annotators in this pipeline must be thread-safe. > -- It is up to that single instance of a pipeline to be thread safe. > "threads" does not enforce anything. > -- "threads n" will attempt to process a maximum of n documents > simultaneously on a pipeline. > -- "threads n" works by running the single pipeline on n threads and > running a single document through the pipeline on each thread. > -- It is entirely up to the pipeline to determine the concurrency of > processing documents. > -- The more thread-safe annotators that don't require locking, the more > utilized the threads will be. > > I hope that makes sense. > > > > ________________________________________ > From: Jeffrey Miller <jeff...@gmail.com> > Sent: Thursday, March 28, 2019 3:51 PM > To: dev@ctakes.apache.org > Subject: Threading and cTAKES (on Spark) [EXTERNAL] > > Hi, > > I am following up on a discussion previously in the "re: ctakes web > service" thread from this month. Apologies if I summarize anyone's comments > incorrectly. Sean had commented that it would not be advisable to create a > pool of pipelines and dispatch 1 per thread in the same process because the > individual AEs have static variables and resources that would be shared > across instances. I can comment that anecdotally, we have not seen crashes > when doing this (but we have seen crashes when we are trying to share 1 > pipeline across > 1 thread). Nevertheless, I cannot guarantee that the > annotations are happening correctly all the time or that we might not > occasionally get unlucky and enter into a race condition. It also sounds > like from Peter's comment in the previous thread, > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_93da8248b03b1c59135fb9b4030b0546a4631ec32d6f5c779d2821cc-40-253Cdev.ctakes.apache.org-253E&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uYabaJeyLV-qVc3xJyB-6w9LVawSFytQEU37NnkdHV0&s=bwkSz7ZhmUnXJZmcm7zVEKuaMpsv_IH-Xs-UYZU3u3M&e= > that a pipeline pool across multiple threads has been stable for his work. > I have a couple of questions: > > 1) Does anyone else have experience with this? Sean, from your comments > before, do you think it might not crash yet produce unreliable results when > using the components in the DefaultClinicalPipeline? > > 2) Sean, you commented before > > > That being said, supposedly you can configure Spark to handle this by > keeping everything contained in a unique copy per thread. Sort of like > ThreadLocal (I think), but more effective on a full-pipeline level. > > Do you have any more information about this- we are currently looking into > it, and it looks like it should be possible to limit each executor (JVM) to > a single thread, but I was wondering if you had any references to the > ThreadLocal-style setup or knew anyone else that had tried it. > > 3) In the TS pipelines, what does the "threads" keyword in the piper file > actually enforce? Is it the number of threads it will allow you to share > the pipeline with or does it automatically create a threaded pipeline for > you? > > Thanks! > Jeff >