Hi! On Sun, Sep 06, 2015 at 10:58:44AM -0400, Eddie Epstein wrote: > On Sun, Sep 6, 2015 at 10:11 AM, Petr Baudis <pa...@ucw.cz> wrote: > > (ii) Use an internal "intermediary" CAS instance in process() to which > > I append my sentences, then use it as a source of output CASes. Turns > > out (surprisingly) that I can't append to a sofa documenttext ("Data for > > Sofa feature setLocalSofaData() has already been set." - not sure about > > the reason for this restriction). > > > > The Sofa data for a view is immutable, otherwise existing annotations > could become invalid.
But in my case, I'd only append to the end, so this concern is moot. It's rather easy anyway to make your annotations go invalid if you use CasCopier a bit. > > I think the only choice except downright unmaintainable hacks (like > > programatically generated M views) is to just give up on preserving my > > annotations and carry over just the sentence texts. Am I missing > > something? > > > > Creating a new view in the intermediate CAS for each of the N input CASes > would work. A new output CAS Sofa would be comprised of data from > multiple views and of course the annotation end points adjusted as when > added to the new output CAS. I guess that .getViewIterator() would make this not so frustrating, so I'll try this route, thanks for the tip! > One problem there is that the intermediate CAS would continue to grow > in size, so there would need to be some point when it could be reset. Indeed, well, when you output all M CASes is a good point. I assume .release() would accomplish this. > > (I'm somewhat tempted to cut my losses short (much too late) and > > abandon UIMA flow control altogether, using only simple pipelines and > > having custom glue code to connect these together, as it seems like > > getting the flow to work in interesting cases is a huge time sink and in > > retrospect, it could never pay off any abstract advantage of easier > > distributed processing (where you probably end up having to chop up the > > pipeline manually anyway). I would probably never recommend new UIMA > > users to strive for a single pipeline with CAS multipliers/mergers and > > begin to consider these features an evolutionary dead end rather than > > advantageous. Not sure if there even *are* any other real users using > > advanced flows besides me and DeepQA. I'll be glad to hear any opinions > > on this!) > > > > > Definitely the advantage to encapsulating analytics in standard UIMA > components is easy scalability via the vertical and horizontal scale out > options offered by UIMA-AS and DUCC. Flexibility in chopping up a > pipeline into services as needed is another advantage. But as far as I understand, you need to explicitly define and deploy AEs that are to be run on different machines anyway. So I'm not sure if the extra value is really that large in the end? > The previously mentioned GALE multimodal application also converted > sequences of N input CASes to M output CASes. In that case the input > CASes represented 2 minutes worth of speech-to-text transcription of > broadcast news, and each output CAS represented a single news story. > The story-CASes then went thru a pipeline that identified the story and > updated a pre-existing summarization for each story. Interesting (and good to hear), thanks! -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton