There is some in-depth discussion about this in the UIMA User mailing list--check the archives. The subject line was "Any interest in this as an open source project?", and it was from May 2008 or possibly started at the end of April.

On Jun 18, 2008, at 12:33 PM, Ahmed Abdeen Hamed wrote:

Thanks for the response. I am still not sure about some aspects of it. I
just found out that the UIMA framework has this following
DictionaryAnnotator feature:
http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/DictionaryAnnotator/doc/pdf/DictionaryAnnotatorUserGuide.pdf

This is similar to what the ConceptMapper doing. Is there any advantage over
the DictionaryAnnotator?

Thank you!
Ahmed

On Wed, Jun 18, 2008 at 10:23 AM, Michael Tanenblatt <
[EMAIL PROTECTED]> wrote:

My original message regarding this talks some about the dictionary format. I am in the process o writing a paper describing the whole of ConceptMapper,
but that is not yet done. Here is what I wrote before:

The structure of the dictionary itself is quite flexible. Entries can have
any number of variants (synonyms), and arbitrary features can be
associated
with dictionary entries. Individual variants inherit features from parent token (i.e., the canonical from), but can override them or add additional features. In the following sample dictionary entry, there are 5 variants
of
the canonical form, and as described earlier, each inherits the SemClass and POS attributes from the canonical form, with the exception of the variant "mesenteric fibromatosis (c48.1)", which overrides the value of
the
SemClass attribute (this is somewhat of a contrived example, just to make
that point):
<token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN">
<variant base="abdominal fibromatosis" />
<variant base="abdominal desmoid" />
<variant base="mesenteric fibromatosis (c48.1)"
SemClass="Diagnosis-Site" />
<variant base="mesenteric fibromatosis" />
<variant base="retroperitoneal fibromatosis" />
</token>


So, testDict.xml is just an example. Two key AE descriptor parameters are "AttributeList" and "FeatureList", which provide the means to map from the XML attributes to the target annotation features. If your target annotation were called "DictTerm" and the DictTerm had the features "canonicalForm", "semanticClass" and "partOfSpeechTag", using the example dictionary snippet
shown above, you would set AttributeList to:

      DictCanon
      SemClass
      POS

and you would set FeatureList to:

      canonicalForm
      semanticClass
      partOfSpeechTag

then, when one of the variants is matched in the text, a new DictTerm would be created with its semanticClass set to the value of the SemClass attribute
and its partOfSpeechTag set to the value of the POS attribute.

One important point: matches are only performed against the strings listed as attributes to the "variant" tag's "base" attribute. It is common practice to have something like the "token" element with something like a canonical form that is the same as one of the variants, but that is not required.

I hope this helps!



On Jun 18, 2008, at 10:06 AM, Ahmed Abdeen Hamed wrote:

Thank Michael! I only recently joined the list so I missed the early
posting. I like this example a lot. I was able to get it to run using the document analyzer from the uimaj-example. I have some questions though: Is the testDict.xml just an arbitrary xml file which means any well-formed xml file should work? How do I get my own xml dictionary files to work without transforming them into the xml format in your testDict.xml file?
Is
there documentation for this so that I can understand it on my own without
bugging the entire list?Thanks!
Ahmed

On Tue, Jun 17, 2008 at 8:05 PM, Michael Tanenblatt <
[EMAIL PROTECTED]>
wrote:

As Thilo mentioned in an email from May 19, 2008, I forgot to include the
source for uima.tt.TokenAnnotation, but otherwise the code should be
fine.

Additionally, the problem you are seeing is with OffsetTokenizer, which
is
just a sample tokenizer--if you have another, more robust tokenizer, you
don't need this OffsetTokenizer.





Reply via email to