On 05/12/13 10:04, Jens Grivolla wrote: > I agree that it might make more sense to model our needs more directly >> instead of trying to squeeze it into the schema we normally use for text >> processing. But at the same time I would of course like to avoid having >> to reimplement many of the things that are already available when using >> AnnotationBase. >> >> For the cross-view indexing issue I was thinking of creating individual >> views for each modality and then a merged view that just contains a >> subset of annotations of each view, and on which we would do the >> cross-modal reasoning. >> >> I just looked again at the GaleMultiModalExample (not much there, >> unfortunately) and saw that e.g. AudioSpan derives from AnnotationBase >> but still has float values for begin/end. I would be really interested >> in learning more about what was done in GALE, but it's hard to find any >> relevant information... >> > The readme at http://svn.apache.org/repos/asf/uima/sandbox/trunk/GaleMultiModalExample/README.txtpoints to two papers with more details on the GALE multi-modal application.
A portion of the view model was like this: Audio view - sofaref to the audio data, which was passed in parallel to multiple ASR annotators. Each ASR annotator put it's transcription in the view, where annotations contained ASR engine IDs. Transcription Views - a text sofa with transcription for an ASR output. Annotations for each word referenced the lexeme annotations in the audio view. Multiple MT annotators would receive each transcription view and add their translations in the view. Translation views - a text sofa with one of the translations, based on a combination of ASR engine and MT engine. Annotations in a translation view referenced the annotations in a transcription view. There were more views. The points here are that 1) views were designed to hold a particular SOFA to be processed by analytics appropriate for that modality, 2) each derived view had cross references to the annotations in views they were derived from, and 3) at the end the GUI presenting the final translation could, for any word(s), show the particular piece of transcription it came from, and/or play the associated audio segment. Eddie