Re: Threading and cTAKES (on Spark) [EXTERNAL]

Jeffrey Miller Thu, 28 Mar 2019 13:33:19 -0700

Thanks again Sean, that is all very helpful.

On Thu, Mar 28, 2019 at 4:20 PM Finan, Sean <
[email protected]> wrote:


> Hi Jeff,
>
> > 1) do you think it might not crash yet produce unreliable results when
> using the components in the DefaultClinicalPipeline?
>
> -- I am pretty certain that you would get unreliable results.  I seem to
> recall attempts with the default pipeline crashing, but with a small corpus
> one could get lucky.
>
> > 2) Do you have any more information about [Spark]
>
> -- No, not really.  I don't work with it, I am just regurgitating from
> memory things read or heard.
>
> > 3) In the TS pipelines, what does the "threads" keyword ...
>
> -- "threads" specifies how many threads share a single pipeline.
> -- All annotators in this pipeline must be thread-safe.
> -- It is up to that single instance of a pipeline to be thread safe.
> "threads" does not enforce anything.
> -- "threads n" will attempt to process a maximum of n documents
> simultaneously on a pipeline.
> -- "threads n" works by running the single pipeline on n threads and
> running a single document through the pipeline on each thread.
> -- It is entirely up to the pipeline to determine the concurrency of
> processing documents.
> -- The more thread-safe annotators that don't require locking, the more
> utilized the threads will be.
>
> I hope that makes sense.
>
>
>
> ________________________________________
> From: Jeffrey Miller <[email protected]>
> Sent: Thursday, March 28, 2019 3:51 PM
> To: [email protected]
> Subject: Threading and cTAKES (on Spark) [EXTERNAL]
>
> Hi,
>
> I am following up on a discussion previously in the "re: ctakes web
> service" thread from this month. Apologies if I summarize anyone's comments
> incorrectly. Sean had commented that it would not be advisable to create a
> pool of pipelines and dispatch 1 per thread in the same process because the
> individual AEs have static variables and resources that would be shared
> across instances. I can comment that anecdotally, we have not seen crashes
> when doing this (but we have seen crashes when we are trying to share 1
> pipeline across > 1 thread). Nevertheless, I cannot guarantee that the
> annotations are happening correctly all the time or that we might not
> occasionally get unlucky and enter into a race condition. It also sounds
> like from Peter's comment in the previous thread,
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_93da8248b03b1c59135fb9b4030b0546a4631ec32d6f5c779d2821cc-40-253Cdev.ctakes.apache.org-253E&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uYabaJeyLV-qVc3xJyB-6w9LVawSFytQEU37NnkdHV0&s=bwkSz7ZhmUnXJZmcm7zVEKuaMpsv_IH-Xs-UYZU3u3M&e=
> that a pipeline pool across multiple threads has been stable for his work.
> I have a couple of questions:
>
> 1) Does anyone else have experience with this? Sean, from your comments
> before, do you think it might not crash yet produce unreliable results when
> using the components in the DefaultClinicalPipeline?
>
> 2) Sean, you commented before
>
> > That being said, supposedly you can configure Spark to handle this by
> keeping everything contained in a unique copy per thread.  Sort of like
> ThreadLocal (I think), but more effective on a full-pipeline level.
>
> Do you have any more information about this- we are currently looking into
> it, and it looks like it should be possible to limit each executor (JVM) to
> a single thread, but I was wondering if you had any references to the
> ThreadLocal-style setup or knew anyone else that had tried it.
>
> 3) In the TS pipelines, what does the "threads" keyword in the piper file
> actually enforce? Is it the number of threads it will allow you to share
> the pipeline with or does it automatically create a threaded pipeline for
> you?
>
> Thanks!
> Jeff
>

Re: Threading and cTAKES (on Spark) [EXTERNAL]

Reply via email to