Actually my implementation does not share a single pipeline across threads, it creates a set of separate pipelines. I found that once the code is in memory, it actually does not take long to instantiate many pipelines. Each one is attached to a thread safe pool object that also hosts a re-settable jCas. When a request arrives on a thread, one of these pipeline-jcas pairs is activated and assigned to a document. Typically each pool object needs about 1.7G. On a multi core machine we can run as many parallel threads as we have memory and send the processor idle time down to 10% or less. Since it doesn't rely on the annotators being thread safe, I can use any of them. Where they might have class variables - these are usually for configuration only, and by instantiating all of them ahead of time on a single thread, they are safely initialized. The multi threading only happens at document processing time. We've run high intensity sessions with many threads for 12-15 hours and never seen any conflicts.
On Thu, Mar 28, 2019 at 9:20 PM Finan, Sean < sean.fi...@childrens.harvard.edu> wrote: > Hi Jeff, > > > 1) do you think it might not crash yet produce unreliable results when > using the components in the DefaultClinicalPipeline? > > -- I am pretty certain that you would get unreliable results. I seem to > recall attempts with the default pipeline crashing, but with a small corpus > one could get lucky. > > > 2) Do you have any more information about [Spark] > > -- No, not really. I don't work with it, I am just regurgitating from > memory things read or heard. > > > 3) In the TS pipelines, what does the "threads" keyword ... > > -- "threads" specifies how many threads share a single pipeline. > -- All annotators in this pipeline must be thread-safe. > -- It is up to that single instance of a pipeline to be thread safe. > "threads" does not enforce anything. > -- "threads n" will attempt to process a maximum of n documents > simultaneously on a pipeline. > -- "threads n" works by running the single pipeline on n threads and > running a single document through the pipeline on each thread. > -- It is entirely up to the pipeline to determine the concurrency of > processing documents. > -- The more thread-safe annotators that don't require locking, the more > utilized the threads will be. > > I hope that makes sense. > > > > ________________________________________ > From: Jeffrey Miller <jeff...@gmail.com> > Sent: Thursday, March 28, 2019 3:51 PM > To: dev@ctakes.apache.org > Subject: Threading and cTAKES (on Spark) [EXTERNAL] > > Hi, > > I am following up on a discussion previously in the "re: ctakes web > service" thread from this month. Apologies if I summarize anyone's comments > incorrectly. Sean had commented that it would not be advisable to create a > pool of pipelines and dispatch 1 per thread in the same process because the > individual AEs have static variables and resources that would be shared > across instances. I can comment that anecdotally, we have not seen crashes > when doing this (but we have seen crashes when we are trying to share 1 > pipeline across > 1 thread). Nevertheless, I cannot guarantee that the > annotations are happening correctly all the time or that we might not > occasionally get unlucky and enter into a race condition. It also sounds > like from Peter's comment in the previous thread, > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_93da8248b03b1c59135fb9b4030b0546a4631ec32d6f5c779d2821cc-40-253Cdev.ctakes.apache.org-253E&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uYabaJeyLV-qVc3xJyB-6w9LVawSFytQEU37NnkdHV0&s=bwkSz7ZhmUnXJZmcm7zVEKuaMpsv_IH-Xs-UYZU3u3M&e= > that a pipeline pool across multiple threads has been stable for his work. > I have a couple of questions: > > 1) Does anyone else have experience with this? Sean, from your comments > before, do you think it might not crash yet produce unreliable results when > using the components in the DefaultClinicalPipeline? > > 2) Sean, you commented before > > > That being said, supposedly you can configure Spark to handle this by > keeping everything contained in a unique copy per thread. Sort of like > ThreadLocal (I think), but more effective on a full-pipeline level. > > Do you have any more information about this- we are currently looking into > it, and it looks like it should be possible to limit each executor (JVM) to > a single thread, but I was wondering if you had any references to the > ThreadLocal-style setup or knew anyone else that had tried it. > > 3) In the TS pipelines, what does the "threads" keyword in the piper file > actually enforce? Is it the number of threads it will allow you to share > the pipeline with or does it automatically create a threaded pipeline for > you? > > Thanks! > Jeff >