Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

Michael Tanenblatt Wed, 18 Jun 2008 07:25:58 -0700

My original message regarding this talks some about the dictionaryformat. I am in the process o writing a paper describing the whole ofConceptMapper, but that is not yet done. Here is what I wrote before:

The structure of the dictionary itself is quite flexible. Entriescan haveany number of variants (synonyms), and arbitrary features can beassociatedwith dictionary entries. Individual variants inherit features fromparenttoken (i.e., the canonical from), but can override them or addadditionalfeatures. In the following sample dictionary entry, there are 5variants ofthe canonical form, and as described earlier, each inherits theSemClass
and POS attributes from the canonical form, with the exception of the
variant "mesenteric fibromatosis (c48.1)", which overrides the valueof theSemClass attribute (this is somewhat of a contrived example, just tomake
that point):
<token canonical="abdominal fibromatosis" SemClass="Diagnosis"POS="NN">
  <variant base="abdominal fibromatosis" />
  <variant base="abdominal desmoid" />
  <variant base="mesenteric fibromatosis (c48.1)"
SemClass="Diagnosis-Site" />
  <variant base="mesenteric fibromatosis" />
  <variant base="retroperitoneal fibromatosis" />
</token>

So, testDict.xml is just an example. Two key AE descriptor parametersare "AttributeList" and "FeatureList", which provide the means to mapfrom the XML attributes to the target annotation features. If yourtarget annotation were called "DictTerm" and the DictTerm had thefeatures "canonicalForm", "semanticClass" and "partOfSpeechTag", usingthe example dictionary snippet shown above, you would setAttributeList to:


        DictCanon
        SemClass
        POS

and you would set FeatureList to:

        canonicalForm
        semanticClass
        partOfSpeechTag

then, when one of the variants is matched in the text, a new DictTermwould be created with its semanticClass set to the value of theSemClass attribute and its partOfSpeechTag set to the value of the POSattribute.

One important point: matches are only performed against the stringslisted as attributes to the "variant" tag's "base" attribute. It iscommon practice to have something like the "token" element withsomething like a canonical form that is the same as one of thevariants, but that is not required.


I hope this helps!


On Jun 18, 2008, at 10:06 AM, Ahmed Abdeen Hamed wrote:

Thank Michael! I only recently joined the list so I missed the early
posting. I like this example a lot. I was able to get it to runusing thedocument analyzer from the uimaj-example. I have some questionsthough:Is the testDict.xml just an arbitrary xml file which means any well-formed
xml file should work? How do I get my own xml dictionary files to work
without transforming them into the xml format in your testDict.xmlfile? Isthere documentation for this so that I can understand it on my ownwithout
bugging the entire list?Thanks!
Ahmed
On Tue, Jun 17, 2008 at 8:05 PM, Michael Tanenblatt <[EMAIL PROTECTED]>
wrote:
As Thilo mentioned in an email from May 19, 2008, I forgot toinclude thesource for uima.tt.TokenAnnotation, but otherwise the code shouldbe fine.
Additionally, the problem you are seeing is with OffsetTokenizer,which isjust a sample tokenizer--if you have another, more robusttokenizer, you
don't need this OffsetTokenizer.

Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

Reply via email to