A point of clarification: Almost everything we get from the SHARP human annotations is associated with a span of text by the annotators. And we need to recover those spans of text with our machine learning models. So in most cases, we need subtypes of Annotation, not subtypes of TOP. This is perhaps the biggest issue with the current type system: the TOP subtypes contain most of what we need, but the Annotation subtypes are often too impoverished to capture the SHARP annotations.
On Nov 26, 2012, at 9:28 PM, "Wu, Stephen T., Ph.D." <[email protected]> wrote: >> * I couldn't find an entity type for "Clinical_attribute", "Devices", "Lab", >> "Phenomena" > "Devices" and "Phenomena" don't exist yet because they were not part of the > CEM models. I need input from someone on CEMs if we're to add these. > > "Clinical_attribute" -- is this what you're looking for: > org.apache.ctakes.typesystem.type.refsem.Attribute > It inherits from Element. But Attribute is a TOP and we need an Annotation here. (An added concern is, does it really make sense to have a raw Attribute, and not some specific sub-type like BodyLaterality or BodySide?) > Lab should be at org.apache.ctakes.typesystem.type.refsem.Lab But Lab is a TOP, and we need an Annotation here. >> * I couldn't find a modifier type (or alternatively, an Annotation subclass) >> for the Knowtator annotations "generic_class", "conditional_class", >> "uncertainty_indicator_class", "distal_or_proximal", "Person", >> "negation_indicator_class", "historyOf_indicator_class", >> "superior_or_inferior", "medial_or_lateral", "dorsal_or_ventral", >> "method_class", "device_class", "allergy_indicator_class", "Route", "Form", >> "Strength", "Strength number", "Strength unit", "Frequency", "Frequency >> number", "Frequency unit", "Value", "Value number", "Value unit", >> "estimated_flag_indicator", "reference_range", "Date", "Status change", >> "Duration", "Dosage". > Use the type org.apache.ctakes.typesystem.type.textsem.Modifier with the > "category" feature. Should there be constants for each of these categories? >> * I couldn't find a place for the normalized value of > "generic_class", --> IdentifiedAnnotation:generic > "conditional_class", --> IdentifiedAnnotation:conditionl > "uncertainty_indicator_class", --> IdentifiedAnnotation:uncertainty > "negation_indicator_class", --> IdentifiedAnnotation:polarity Ok. > "distal_or_proximal", --> BodyLaterality:value > "superior_or_inferior", --> BodyLaterality:value > "dorsal_or_ventral", --> BodyLaterality:value > "medial_or_lateral", --> BodyLaterality:value > "device_class", --> ProcedureDevice:value And then set the Modifier.normalizedForm to BodyLaterality or ProcedureDevice? Ok. > "Person", --> Entity But Entity is a TOP, not an Annotation. >> After working with this data I think we should consider having separate UIMA >> Annotation sub-types for each of the things that are Modifiers now. For >> example, if we have a real Severity Annotation for textual mentions of >> severity, then the CAS makes it easy to select these. We have exactly this >> use >> case in relation extractor - we need just the Severity modifiers, excluding >> all the other modifiers. Basically, I think the principle we should follow in >> UIMA is: >> >> "If you could imagine searching the CAS for something, then that something >> should have it's own Annotation sub-type." >> > It's a good point, and a relatively good principle, but we have decided > against it in the past. The reason is a countering principle: > > "Do not put locally used (component-specific) types in the CAS." This principle is not relevant here. The types we're talking about are not used locally within a single AnalysisEngine. They're read in from the SHARPKnowtatorXMLReader AnalysisEngine, and used separately in the ModifierExtractorAnnotator AnalysisEngine, the DegreeOfRelationExtractorAnnotator AnalysisEngine, EventAnnotator AnalysisEngine, TimeAnnotator AnalysisEngine, etc. So they can't be local to a single AnalysisEngine, and they must be in the CAS. > There is no garbage collection in UIMA (despite things being deleted from > the index) and extra types will bloat the CAS system, though admittedly is > not too terrible a bloating. I don't see how garbage collection is relevant here. We're going to create exactly the same number of Modifiers. It's just whether we create them as raw Modifiers or Modifier sub types. Are you saying there's some significant extra cost to having extra types, even when the total number of instances across all types is constant? > Two doubts that could change my mind: > 1) Do we envision evaluation of the Modifiers/attributes -- apart from the > Named Entities they're attached to? If so, we need to preserve this > information right at the beginning. That's exactly what I'm talking about with the severity modifiers. We have a severity modifier extraction annotator, and we *do* need to evaluate its performance by comparing the severity modifiers it extracts to those in the annotated data. (We need this annotator, just like we need the UMLS entity annotator, so that our relation extraction annotator can find relations between severities and UMLS entities.) The same is essentially true for everything annotated in SHARP. It's all annotated with the intention of training machine learning models to reproduce those annotations. So we really do want everything that's in the Knowtator XML annotations to be loaded and accessible to all our UIMA AnalysisEngines. > 2) Will these modifiers be reusable downstream? I'm not sure what you mean here. Are you suggesting that the type system should only have types for things that external users of cTAKES might need, and that we shouldn't have types for things that must be passed between different cTAKES AnalysisEngines? If that's the case, I think this would be a step in a very wrong direction. In UIMA, anything that has to be passed between AnalysisEngines should be declared in the type system. And the whole point of having a type system is to ease the passing of this information. So hobbling the types that we pass between cTAKES annotators just to reduce the size of the type system for external users just doesn't make sense. Steve
