Hi Sarma, I encountered the same issue(s) with LVG with multiple threads in the same JVM process. We've been scaling out by spawning off multiple pipelines in different processes. However, it would be interesting to see identified which components are not thread safe and take advantage of spawning multiple components in the same process. Another area for optimization as you pointed out is the mem footprint. It would be good if someone has a chance to profile the mem usage and see if we could lower the footprint- my initial hunch is that all of the models are loaded into memory as a cache. If you're interested, feel free to open a Jira so it could be tracked you could get credit for the contributions.. -Pei
On Jan 16, 2013, at 5:49 PM, "Karthik Sarma" <[email protected]> wrote: > Hi folks, > > I know that the official position is that cTAKES is not thread-safe. I'm > wondering, however, if anyone has looked into using multiple processing > pipelines (via the processingUnitThreadCount directive in a CPE descriptor > and documenting where the thread safety problems lie. > > I've given it a bit of a try, and on first glance the biggest issue seems > to be in the LVG api, which isn't at all thread-safe (they seem to claim > that it would be thread-safe so long as API instances are not shared, but > that doesn't seem prima facie true since it throws errors when multiple > pipelines are used, which *should* be creating multiple LVG api instances). > > I haven't found any other serious issues, but perhaps you folks might be > familiar with some. > > There is, of course, the memory issue -- cTAKES' memory footprint alone on > my machine with a single pipeline and using a mysql umls database is over > 2GB; this is presumably the cost of each pipeline, though I can't actually > really figure out what all that memory is being used for since none of the > in-memory DBs and indexes used seem to be anywhere near that size. > > It is, of course, possible to split datasets and simply run multiple > processes, but my feeling is that there must be a lot of unnecessary > overhead there since all the operations we actually do (other than the CAS > consumers) are read-only. It seems to me that cTAKES ought to be limited > only by disk/memory throughput and total CPU capacity because of the nature > of the load... > > Anyway, if anyone else has thoughts, I'd be interested. This is something > I'd be interested in taking a stab at resolving, since I've been poking > around in this direction behind the scenes for some time now. My group has > access to huge databases but limited computational resources, and I'd like > to make the most of what we've got! > > Karthik > > > -- > Karthik Sarma > UCLA Medical Scientist Training Program Class of 20?? > Member, UCLA Medical Imaging & Informatics Lab > Member, CA Delegation to the House of Delegates of the American Medical > Association > [email protected] > gchat: [email protected] > linkedin: www.linkedin.com/in/ksarma
