Hi Jeff,

> 1) do you think it might not crash yet produce unreliable results when
using the components in the DefaultClinicalPipeline?

-- I am pretty certain that you would get unreliable results.  I seem to recall 
attempts with the default pipeline crashing, but with a small corpus one could 
get lucky.

> 2) Do you have any more information about [Spark]

-- No, not really.  I don't work with it, I am just regurgitating from memory 
things read or heard.

> 3) In the TS pipelines, what does the "threads" keyword ...

-- "threads" specifies how many threads share a single pipeline.   
-- All annotators in this pipeline must be thread-safe.
-- It is up to that single instance of a pipeline to be thread safe.  "threads" 
does not enforce anything.
-- "threads n" will attempt to process a maximum of n documents simultaneously 
on a pipeline.
-- "threads n" works by running the single pipeline on n threads and running a 
single document through the pipeline on each thread.
-- It is entirely up to the pipeline to determine the concurrency of processing 
documents.
-- The more thread-safe annotators that don't require locking, the more 
utilized the threads will be.

I hope that makes sense.



________________________________________
From: Jeffrey Miller <jeff...@gmail.com>
Sent: Thursday, March 28, 2019 3:51 PM
To: dev@ctakes.apache.org
Subject: Threading and cTAKES (on Spark) [EXTERNAL]

Hi,

I am following up on a discussion previously in the "re: ctakes web
service" thread from this month. Apologies if I summarize anyone's comments
incorrectly. Sean had commented that it would not be advisable to create a
pool of pipelines and dispatch 1 per thread in the same process because the
individual AEs have static variables and resources that would be shared
across instances. I can comment that anecdotally, we have not seen crashes
when doing this (but we have seen crashes when we are trying to share 1
pipeline across > 1 thread). Nevertheless, I cannot guarantee that the
annotations are happening correctly all the time or that we might not
occasionally get unlucky and enter into a race condition. It also sounds
like from Peter's comment in the previous thread,
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_93da8248b03b1c59135fb9b4030b0546a4631ec32d6f5c779d2821cc-40-253Cdev.ctakes.apache.org-253E&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uYabaJeyLV-qVc3xJyB-6w9LVawSFytQEU37NnkdHV0&s=bwkSz7ZhmUnXJZmcm7zVEKuaMpsv_IH-Xs-UYZU3u3M&e=
that a pipeline pool across multiple threads has been stable for his work.
I have a couple of questions:

1) Does anyone else have experience with this? Sean, from your comments
before, do you think it might not crash yet produce unreliable results when
using the components in the DefaultClinicalPipeline?

2) Sean, you commented before

> That being said, supposedly you can configure Spark to handle this by
keeping everything contained in a unique copy per thread.  Sort of like
ThreadLocal (I think), but more effective on a full-pipeline level.

Do you have any more information about this- we are currently looking into
it, and it looks like it should be possible to limit each executor (JVM) to
a single thread, but I was wondering if you had any references to the
ThreadLocal-style setup or knew anyone else that had tried it.

3) In the TS pipelines, what does the "threads" keyword in the piper file
actually enforce? Is it the number of threads it will allow you to share
the pipeline with or does it automatically create a threaded pipeline for
you?

Thanks!
Jeff

Reply via email to