Hi Jeff, > 1) do you think it might not crash yet produce unreliable results when using the components in the DefaultClinicalPipeline?
-- I am pretty certain that you would get unreliable results. I seem to recall attempts with the default pipeline crashing, but with a small corpus one could get lucky. > 2) Do you have any more information about [Spark] -- No, not really. I don't work with it, I am just regurgitating from memory things read or heard. > 3) In the TS pipelines, what does the "threads" keyword ... -- "threads" specifies how many threads share a single pipeline. -- All annotators in this pipeline must be thread-safe. -- It is up to that single instance of a pipeline to be thread safe. "threads" does not enforce anything. -- "threads n" will attempt to process a maximum of n documents simultaneously on a pipeline. -- "threads n" works by running the single pipeline on n threads and running a single document through the pipeline on each thread. -- It is entirely up to the pipeline to determine the concurrency of processing documents. -- The more thread-safe annotators that don't require locking, the more utilized the threads will be. I hope that makes sense. ________________________________________ From: Jeffrey Miller <jeff...@gmail.com> Sent: Thursday, March 28, 2019 3:51 PM To: dev@ctakes.apache.org Subject: Threading and cTAKES (on Spark) [EXTERNAL] Hi, I am following up on a discussion previously in the "re: ctakes web service" thread from this month. Apologies if I summarize anyone's comments incorrectly. Sean had commented that it would not be advisable to create a pool of pipelines and dispatch 1 per thread in the same process because the individual AEs have static variables and resources that would be shared across instances. I can comment that anecdotally, we have not seen crashes when doing this (but we have seen crashes when we are trying to share 1 pipeline across > 1 thread). Nevertheless, I cannot guarantee that the annotations are happening correctly all the time or that we might not occasionally get unlucky and enter into a race condition. It also sounds like from Peter's comment in the previous thread, https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_93da8248b03b1c59135fb9b4030b0546a4631ec32d6f5c779d2821cc-40-253Cdev.ctakes.apache.org-253E&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uYabaJeyLV-qVc3xJyB-6w9LVawSFytQEU37NnkdHV0&s=bwkSz7ZhmUnXJZmcm7zVEKuaMpsv_IH-Xs-UYZU3u3M&e= that a pipeline pool across multiple threads has been stable for his work. I have a couple of questions: 1) Does anyone else have experience with this? Sean, from your comments before, do you think it might not crash yet produce unreliable results when using the components in the DefaultClinicalPipeline? 2) Sean, you commented before > That being said, supposedly you can configure Spark to handle this by keeping everything contained in a unique copy per thread. Sort of like ThreadLocal (I think), but more effective on a full-pipeline level. Do you have any more information about this- we are currently looking into it, and it looks like it should be possible to limit each executor (JVM) to a single thread, but I was wondering if you had any references to the ThreadLocal-style setup or knew anyone else that had tried it. 3) In the TS pipelines, what does the "threads" keyword in the piper file actually enforce? Is it the number of threads it will allow you to share the pipeline with or does it automatically create a threaded pipeline for you? Thanks! Jeff