Re: type system changes needed to read SHARP data

Steven Bethard Mon, 26 Nov 2012 15:04:28 -0800

A point of clarification: Almost everything we get from the SHARP human 
annotations is associated with a span of text by the annotators. And we need to 
recover those spans of text with our machine learning models. So in most cases, 
we need subtypes of Annotation, not subtypes of TOP. This is perhaps the 
biggest issue with the current type system: the TOP subtypes contain most of 
what we need, but the Annotation subtypes are often too impoverished to capture 
the SHARP annotations.

On Nov 26, 2012, at 9:28 PM, "Wu, Stephen T., Ph.D." <[email protected]> 
wrote:
>> * I couldn't find an entity type for "Clinical_attribute", "Devices", "Lab",
>> "Phenomena"
> "Devices" and "Phenomena" don't exist yet because they were not part of the
> CEM models.  I need input from someone on CEMs if we're to add these.
> 
> "Clinical_attribute" -- is this what you're looking for:
> org.apache.ctakes.typesystem.type.refsem.Attribute
> It inherits from Element.

But Attribute is a TOP and we need an Annotation here. (An added concern is, 
does it really make sense to have a raw Attribute, and not some specific 
sub-type like BodyLaterality or BodySide?)

> Lab should be at org.apache.ctakes.typesystem.type.refsem.Lab

But Lab is a TOP, and we need an Annotation here.

>> * I couldn't find a modifier type (or alternatively, an Annotation subclass)
>> for the Knowtator annotations "generic_class", "conditional_class",
>> "uncertainty_indicator_class", "distal_or_proximal", "Person",
>> "negation_indicator_class", "historyOf_indicator_class",
>> "superior_or_inferior", "medial_or_lateral", "dorsal_or_ventral",
>> "method_class", "device_class", "allergy_indicator_class", "Route", "Form",
>> "Strength", "Strength number", "Strength unit", "Frequency", "Frequency
>> number", "Frequency unit", "Value", "Value number", "Value unit",
>> "estimated_flag_indicator", "reference_range", "Date", "Status change",
>> "Duration", "Dosage".
> Use the type org.apache.ctakes.typesystem.type.textsem.Modifier with the
> "category" feature.

Should there be constants for each of these categories?

>> * I couldn't find a place for the normalized value of
> "generic_class", --> IdentifiedAnnotation:generic
> "conditional_class",  --> IdentifiedAnnotation:conditionl
> "uncertainty_indicator_class", --> IdentifiedAnnotation:uncertainty
> "negation_indicator_class",  --> IdentifiedAnnotation:polarity

Ok.

> "distal_or_proximal", --> BodyLaterality:value
> "superior_or_inferior", --> BodyLaterality:value
> "dorsal_or_ventral", --> BodyLaterality:value
> "medial_or_lateral", --> BodyLaterality:value
> "device_class", --> ProcedureDevice:value

And then set the Modifier.normalizedForm to BodyLaterality or ProcedureDevice? 
Ok.

> "Person", --> Entity

But Entity is a TOP, not an Annotation.

>> After working with this data I think we should consider having separate UIMA
>> Annotation sub-types for each of the things that are Modifiers now. For
>> example, if we have a real Severity Annotation for textual mentions of
>> severity, then the CAS makes it easy to select these. We have exactly this 
>> use
>> case in relation extractor - we need just the Severity modifiers, excluding
>> all the other modifiers. Basically, I think the principle we should follow in
>> UIMA is:
>> 
>> "If you could imagine searching the CAS for something, then that something
>> should have it's own Annotation sub-type."
>> 
> It's a good point, and a relatively good principle, but we have decided
> against it in the past.  The reason is a countering principle:
> 
> "Do not put locally used (component-specific) types in the CAS."

This principle is not relevant here. The types we're talking about are not used 
locally within a single AnalysisEngine. They're read in from the 
SHARPKnowtatorXMLReader AnalysisEngine, and used separately in the 
ModifierExtractorAnnotator AnalysisEngine, the 
DegreeOfRelationExtractorAnnotator AnalysisEngine, EventAnnotator 
AnalysisEngine, TimeAnnotator AnalysisEngine, etc. So they can't be local to a 
single AnalysisEngine, and they must be in the CAS.

> There is no garbage collection in UIMA (despite things being deleted from
> the index) and extra types will bloat the CAS system, though admittedly is
> not too terrible a bloating.

I don't see how garbage collection is relevant here. We're going to create 
exactly the same number of Modifiers. It's just whether we create them as raw 
Modifiers or Modifier sub types. Are you saying there's some significant extra 
cost to having extra types, even when the total number of instances across all 
types is constant?

> Two doubts that could change my mind:
> 1) Do we envision evaluation of the Modifiers/attributes -- apart from the
> Named Entities they're attached to?  If so, we need to preserve this
> information right at the beginning.

That's exactly what I'm talking about with the severity modifiers. We have a 
severity modifier extraction annotator, and we *do* need to evaluate its 
performance by comparing the severity modifiers it extracts to those in the 
annotated data. (We need this annotator, just like we need the UMLS entity 
annotator, so that our relation extraction annotator can find relations between 
severities and UMLS entities.)

The same is essentially true for everything annotated in SHARP. It's all 
annotated with the intention of training machine learning models to reproduce 
those annotations. So we really do want everything that's in the Knowtator XML 
annotations to be loaded and accessible to all our UIMA AnalysisEngines.

> 2) Will these modifiers be reusable downstream?

I'm not sure what you mean here. Are you suggesting that the type system should 
only have types for things that external users of cTAKES might need, and that 
we shouldn't have types for things that must be passed between different cTAKES 
AnalysisEngines?

If that's the case, I think this would be a step in a very wrong direction. In 
UIMA, anything that has to be passed between AnalysisEngines should be declared 
in the type system. And the whole point of having a type system is to ease the 
passing of this information. So hobbling the types that we pass between cTAKES 
annotators just to reduce the size of the type system for external users just 
doesn't make sense.

Steve

Re: type system changes needed to read SHARP data

Reply via email to