My original message regarding this talks some about the dictionary format. I am in the process o writing a paper describing the whole of ConceptMapper, but that is not yet done. Here is what I wrote before:

The structure of the dictionary itself is quite flexible. Entries can have any number of variants (synonyms), and arbitrary features can be associated with dictionary entries. Individual variants inherit features from parent token (i.e., the canonical from), but can override them or add additional features. In the following sample dictionary entry, there are 5 variants of the canonical form, and as described earlier, each inherits the SemClass
and POS attributes from the canonical form, with the exception of the
variant "mesenteric fibromatosis (c48.1)", which overrides the value of the SemClass attribute (this is somewhat of a contrived example, just to make
that point):
<token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN">
  <variant base="abdominal fibromatosis" />
  <variant base="abdominal desmoid" />
  <variant base="mesenteric fibromatosis (c48.1)"
SemClass="Diagnosis-Site" />
  <variant base="mesenteric fibromatosis" />
  <variant base="retroperitoneal fibromatosis" />
</token>

So, testDict.xml is just an example. Two key AE descriptor parameters are "AttributeList" and "FeatureList", which provide the means to map from the XML attributes to the target annotation features. If your target annotation were called "DictTerm" and the DictTerm had the features "canonicalForm", "semanticClass" and "partOfSpeechTag", using the example dictionary snippet shown above, you would set AttributeList to:

        DictCanon
        SemClass
        POS

and you would set FeatureList to:

        canonicalForm
        semanticClass
        partOfSpeechTag

then, when one of the variants is matched in the text, a new DictTerm would be created with its semanticClass set to the value of the SemClass attribute and its partOfSpeechTag set to the value of the POS attribute.

One important point: matches are only performed against the strings listed as attributes to the "variant" tag's "base" attribute. It is common practice to have something like the "token" element with something like a canonical form that is the same as one of the variants, but that is not required.

I hope this helps!


On Jun 18, 2008, at 10:06 AM, Ahmed Abdeen Hamed wrote:

Thank Michael! I only recently joined the list so I missed the early
posting. I like this example a lot. I was able to get it to run using the document analyzer from the uimaj-example. I have some questions though: Is the testDict.xml just an arbitrary xml file which means any well- formed
xml file should work? How do I get my own xml dictionary files to work
without transforming them into the xml format in your testDict.xml file? Is there documentation for this so that I can understand it on my own without
bugging the entire list?Thanks!
Ahmed

On Tue, Jun 17, 2008 at 8:05 PM, Michael Tanenblatt <[EMAIL PROTECTED] >
wrote:

As Thilo mentioned in an email from May 19, 2008, I forgot to include the source for uima.tt.TokenAnnotation, but otherwise the code should be fine.

Additionally, the problem you are seeing is with OffsetTokenizer, which is just a sample tokenizer--if you have another, more robust tokenizer, you
don't need this OffsetTokenizer.



Reply via email to