Thank you, Richard! I had not considered using a database to handle and
aggregate segment-level data, but that makes a lot of sense. UIMA is for
"Unstructured Information Management" after all; once it's structured, I
can use any number of other tools.


Regards,
Matt

On Wed, Aug 26, 2015 at 10:51 AM, Richard Eckart de Castilho <r...@apache.org
> wrote:

> Hi,
>
> I'd probably opt for approach 1. Adding provenance metadata to CASes
> or maintaining such data externally is a useful thing anyway. If you
> maintain such data in a database/index, you can quickly cut subcorpora
> as necessary and are flexible for future use-cases that might require
> differently cut subcorpora. If you also maintain certain statistics
> in your database, it allows you to query/aggregate faster than if you
> have to read all CASes with a certain property just to gather the
> statistics.
> Updating the DB whenever a change is made to a CAS (or a block of changes
> has been made) would be sufficient and could be handled by a dedicated
> component that you place at the end of all kinds of pipelines that you
> might run.
>
> Views would seem more appropriate if you cared about having one view
> for the transcription and another for the audio signal and want to
> annotate them independently / align them to each other.
>
> Cheers,
>
> -- Richard
>
> On 26.08.2015, at 16:45, Matthew DeAngelis <roni...@gmail.com> wrote:
>
> > Hello UIMA Gurus,
> >
> > I am relatively new to UIMA, so please excuse the general nature of my
> > question and any butchering of the terminology.
> >
> > I am attempting to write an application to process transcripts of audio
> > files. Each "raw" transcript is in its own HTML file with a section
> listing
> > biographical information for the speakers on the call followed by a
> number
> > of sections containing transcriptions of the discussion of different
> > topics. I would like to be able to analyze each speaker's contributions
> > separately by topic and then aggregate and compare these analyses between
> > speakers and between each speaker and the full text. I was thinking that
> I
> > would break the document into a new segment each time the speaker or the
> > section of the document changes (attaching relevant speaker metadata to
> > each section), run additional Analysis Engines on each segment
> (tokenizer,
> > etc.), and then arbitrarily recombine the results of the analysis by
> > speaker, etc.
> >
> > Looking through the documentation, I am considering two approaches:
> >
> > 1. Using a CAS Multiplier. Under this approach, I would follow the
> example
> > in Chapter 7 of the documentation, divide on section and speaker
> > demarcations, add metadata to each CAS, run additional AEs on the CASes,
> > and then use a multiplier to recombine the many CASes for each document
> > (one for the whole transcript, one for each section, one for each
> speaker,
> > etc.). The advantage of this approach is that it seems easy to
> incorporate
> > into a pipeline of AEs, since they are designed to run on each CAS. The
> > disadvantage is that it seems unwieldy to have to keep track of all of
> the
> > related CASes per document and aggregate statistics across the CASes.
> >
> > 2. Use CAS Views. This option is appealing because it seems like CAS
> Views
> > were designed for associating many different aspects of the same document
> > with one another. However, it looks to me that I would have to specify
> > different views both when parsing the document into sections and when
> > passing them through subsequent AEs, which would make it harder to drop
> > into an existing pipeline. I may be misunderstanding how subsequent AEs
> > work with Views, however.
> >
> > For those more experience with UIMA, how would you approach this problem?
> > It's entirely possible that I am missing a third (fourth, fifth...)
> > approach that would work better than either of those above, so any
> guidance
> > would be much appreciated.
> >
> >
> > Regards and thanks,
> > Matt
>
>

Reply via email to