Thank you, Richard! I had not considered using a database to handle and aggregate segment-level data, but that makes a lot of sense. UIMA is for "Unstructured Information Management" after all; once it's structured, I can use any number of other tools.
Regards, Matt On Wed, Aug 26, 2015 at 10:51 AM, Richard Eckart de Castilho <r...@apache.org > wrote: > Hi, > > I'd probably opt for approach 1. Adding provenance metadata to CASes > or maintaining such data externally is a useful thing anyway. If you > maintain such data in a database/index, you can quickly cut subcorpora > as necessary and are flexible for future use-cases that might require > differently cut subcorpora. If you also maintain certain statistics > in your database, it allows you to query/aggregate faster than if you > have to read all CASes with a certain property just to gather the > statistics. > Updating the DB whenever a change is made to a CAS (or a block of changes > has been made) would be sufficient and could be handled by a dedicated > component that you place at the end of all kinds of pipelines that you > might run. > > Views would seem more appropriate if you cared about having one view > for the transcription and another for the audio signal and want to > annotate them independently / align them to each other. > > Cheers, > > -- Richard > > On 26.08.2015, at 16:45, Matthew DeAngelis <roni...@gmail.com> wrote: > > > Hello UIMA Gurus, > > > > I am relatively new to UIMA, so please excuse the general nature of my > > question and any butchering of the terminology. > > > > I am attempting to write an application to process transcripts of audio > > files. Each "raw" transcript is in its own HTML file with a section > listing > > biographical information for the speakers on the call followed by a > number > > of sections containing transcriptions of the discussion of different > > topics. I would like to be able to analyze each speaker's contributions > > separately by topic and then aggregate and compare these analyses between > > speakers and between each speaker and the full text. I was thinking that > I > > would break the document into a new segment each time the speaker or the > > section of the document changes (attaching relevant speaker metadata to > > each section), run additional Analysis Engines on each segment > (tokenizer, > > etc.), and then arbitrarily recombine the results of the analysis by > > speaker, etc. > > > > Looking through the documentation, I am considering two approaches: > > > > 1. Using a CAS Multiplier. Under this approach, I would follow the > example > > in Chapter 7 of the documentation, divide on section and speaker > > demarcations, add metadata to each CAS, run additional AEs on the CASes, > > and then use a multiplier to recombine the many CASes for each document > > (one for the whole transcript, one for each section, one for each > speaker, > > etc.). The advantage of this approach is that it seems easy to > incorporate > > into a pipeline of AEs, since they are designed to run on each CAS. The > > disadvantage is that it seems unwieldy to have to keep track of all of > the > > related CASes per document and aggregate statistics across the CASes. > > > > 2. Use CAS Views. This option is appealing because it seems like CAS > Views > > were designed for associating many different aspects of the same document > > with one another. However, it looks to me that I would have to specify > > different views both when parsing the document into sections and when > > passing them through subsequent AEs, which would make it harder to drop > > into an existing pipeline. I may be misunderstanding how subsequent AEs > > work with Views, however. > > > > For those more experience with UIMA, how would you approach this problem? > > It's entirely possible that I am missing a third (fourth, fifth...) > > approach that would work better than either of those above, so any > guidance > > would be much appreciated. > > > > > > Regards and thanks, > > Matt > >