Re: Threading and cTAKES (on Spark) [EXTERNAL]

Peter Abramowitsch Thu, 28 Mar 2019 13:38:00 -0700

Actually  my implementation does not share a single pipeline across
threads, it creates a set of separate pipelines.  I found that once the
code is in memory, it actually does not take long to instantiate many
pipelines.  Each one is attached to a thread safe pool object that also
hosts a re-settable jCas.  When a request arrives on a thread, one of these
pipeline-jcas pairs is activated and assigned to a document.   Typically
each pool object needs about 1.7G.  On a multi core machine we can run as
many parallel threads as we have memory and send the processor idle time
down to 10% or less.   Since it doesn't rely on the annotators being thread
safe, I can use any of them.  Where they might have class variables - these
are usually for configuration only, and by instantiating all of them ahead
of time on a single thread, they are safely initialized.  The multi
threading only happens at document processing time.  We've run high
intensity sessions with many threads for 12-15 hours and never seen any
conflicts.


On Thu, Mar 28, 2019 at 9:20 PM Finan, Sean <
[email protected]> wrote:

> Hi Jeff,
>
> > 1) do you think it might not crash yet produce unreliable results when
> using the components in the DefaultClinicalPipeline?
>
> -- I am pretty certain that you would get unreliable results.  I seem to
> recall attempts with the default pipeline crashing, but with a small corpus
> one could get lucky.
>
> > 2) Do you have any more information about [Spark]
>
> -- No, not really.  I don't work with it, I am just regurgitating from
> memory things read or heard.
>
> > 3) In the TS pipelines, what does the "threads" keyword ...
>
> -- "threads" specifies how many threads share a single pipeline.
> -- All annotators in this pipeline must be thread-safe.
> -- It is up to that single instance of a pipeline to be thread safe.
> "threads" does not enforce anything.
> -- "threads n" will attempt to process a maximum of n documents
> simultaneously on a pipeline.
> -- "threads n" works by running the single pipeline on n threads and
> running a single document through the pipeline on each thread.
> -- It is entirely up to the pipeline to determine the concurrency of
> processing documents.
> -- The more thread-safe annotators that don't require locking, the more
> utilized the threads will be.
>
> I hope that makes sense.
>
>
>
> ________________________________________
> From: Jeffrey Miller <[email protected]>
> Sent: Thursday, March 28, 2019 3:51 PM
> To: [email protected]
> Subject: Threading and cTAKES (on Spark) [EXTERNAL]
>
> Hi,
>
> I am following up on a discussion previously in the "re: ctakes web
> service" thread from this month. Apologies if I summarize anyone's comments
> incorrectly. Sean had commented that it would not be advisable to create a
> pool of pipelines and dispatch 1 per thread in the same process because the
> individual AEs have static variables and resources that would be shared
> across instances. I can comment that anecdotally, we have not seen crashes
> when doing this (but we have seen crashes when we are trying to share 1
> pipeline across > 1 thread). Nevertheless, I cannot guarantee that the
> annotations are happening correctly all the time or that we might not
> occasionally get unlucky and enter into a race condition. It also sounds
> like from Peter's comment in the previous thread,
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_93da8248b03b1c59135fb9b4030b0546a4631ec32d6f5c779d2821cc-40-253Cdev.ctakes.apache.org-253E&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uYabaJeyLV-qVc3xJyB-6w9LVawSFytQEU37NnkdHV0&s=bwkSz7ZhmUnXJZmcm7zVEKuaMpsv_IH-Xs-UYZU3u3M&e=
> that a pipeline pool across multiple threads has been stable for his work.
> I have a couple of questions:
>
> 1) Does anyone else have experience with this? Sean, from your comments
> before, do you think it might not crash yet produce unreliable results when
> using the components in the DefaultClinicalPipeline?
>
> 2) Sean, you commented before
>
> > That being said, supposedly you can configure Spark to handle this by
> keeping everything contained in a unique copy per thread.  Sort of like
> ThreadLocal (I think), but more effective on a full-pipeline level.
>
> Do you have any more information about this- we are currently looking into
> it, and it looks like it should be possible to limit each executor (JVM) to
> a single thread, but I was wondering if you had any references to the
> ThreadLocal-style setup or knew anyone else that had tried it.
>
> 3) In the TS pipelines, what does the "threads" keyword in the piper file
> actually enforce? Is it the number of threads it will allow you to share
> the pipeline with or does it automatically create a threaded pipeline for
> you?
>
> Thanks!
> Jeff
>

Re: Threading and cTAKES (on Spark) [EXTERNAL]

Reply via email to