FYI, I just doubled the number of backends and clients and increased the throughput to ~1000 docs/second. Server utilization is only minimal now.
I should note, that unlike on a Spark cluster, this is running on 2-old servers and a VM. The nice thing about Kubernetes is that you can easily scale up or down the number of instances using horizontal pod autoscaling. Plus, it's a lot easier to manage than a Spark cluster. We just started running the cTAKES pipeline on this, so it's an experiment in process. So far, the results are very decent. I'll scale it up even more in a day or so. Greg-- On Tue, Nov 17, 2020 at 11:10 AM Greg Silverman <[email protected]> wrote: > We at the UMN NLP/IE Lab have developed NLP-ADAPT-kube to scale out 4-UIMA > NLP annotators using Kubernetes/UIMA-AS, including cTAKES, CLAMP, MetaMap > (using the UIMA wrapper), and our own homegrown BioMedICUS. Our project is > here: https://github.com/nlpie/nlp-adapt-kube > > There are 2-versions: One for CPM, which includes QuickUMLS; and the other > for UIMA-AS. The AS versions are under the docker folder and the argo-k8s > folder, and use the 4-engines mentioned above. There is a project Wiki (but > it is slightly out-of-date). We are in the process of working non-UIMA > engines (like QuickUMLS and our new version of BioMedICUS) into the AS > workflow (we're using AMQ for message queuing). > > We're currently running cTAKES using Kubernetes hpa with 6-backends and > 2-clients across 3-compute nodes getting very decent throughput (~150 > docs/second). We could definitely scale it up even further. > > For comparison how well this scales, we were running 64-MetaMap backends > with 16-clients and getting ~40 docs/second for very large clinical > documents (which for MetaMap is very decent). This was across 5-compute > nodes. > > If you're interested, we can assist in implementation. The client does > require some customizations based on the backend database you're using: > https://github.com/nlpie/nlp-adapt-kube/tree/master/docker/as/client, but > that is pretty straightforward. > > Best! > > Greg-- > > > > > > > On Tue, Nov 17, 2020 at 10:47 AM John Doe <[email protected]> wrote: > >> Hello, >> >> I'm new to cTAKES and was wondering what the options are for scaling out >> the default clinical pipeline. I'm running it on a large number of clinical >> notes using runClinicalPipeline.bat and specifying the input directory with >> the notes. What are the best options for doing this in a more scalable way? >> For example, can I parallelize it with UIMA-AS? Or should I manually use >> multiple command prompts to run the clinical pipeline on a different set of >> clinical notes in parallel? I'm not sure if there is any build-in solution >> or community resource which uses EMR/Spark or some other method to achieve >> this. >> >> Thank you for your help. >> > > > -- > Greg M. Silverman > Senior Systems Developer > NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group> > Department of Surgery > University of Minnesota > [email protected] > > -- Greg M. Silverman Senior Systems Developer NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group> Department of Surgery University of Minnesota [email protected]
