Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

Michael Tanenblatt Wed, 18 Jun 2008 09:42:20 -0700

There is some in-depth discussion about this in the UIMA User mailinglist--check the archives. The subject line was "Any interest in thisas an open source project?", and it was from May 2008 or possiblystarted at the end of April.


On Jun 18, 2008, at 12:33 PM, Ahmed Abdeen Hamed wrote:

Thanks for the response. I am still not sure about some aspects ofit. I
just found out that the UIMA framework has this following
DictionaryAnnotator feature:
http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/DictionaryAnnotator/doc/pdf/DictionaryAnnotatorUserGuide.pdf
This is similar to what the ConceptMapper doing. Is there anyadvantage over
the DictionaryAnnotator?

Thank you!
Ahmed

On Wed, Jun 18, 2008 at 10:23 AM, Michael Tanenblatt <
[EMAIL PROTECTED]> wrote:
My original message regarding this talks some about the dictionaryformat.I am in the process o writing a paper describing the whole ofConceptMapper,
but that is not yet done. Here is what I wrote before:
The structure of the dictionary itself is quite flexible. Entriescan have
any number of variants (synonyms), and arbitrary features can be
associated
with dictionary entries. Individual variants inherit features fromparenttoken (i.e., the canonical from), but can override them or addadditionalfeatures. In the following sample dictionary entry, there are 5variants
of
the canonical form, and as described earlier, each inherits theSemClassand POS attributes from the canonical form, with the exception ofthevariant "mesenteric fibromatosis (c48.1)", which overrides thevalue of
the
SemClass attribute (this is somewhat of a contrived example, justto make
that point):
<token canonical="abdominal fibromatosis" SemClass="Diagnosis"POS="NN">
<variant base="abdominal fibromatosis" />
<variant base="abdominal desmoid" />
<variant base="mesenteric fibromatosis (c48.1)"
SemClass="Diagnosis-Site" />
<variant base="mesenteric fibromatosis" />
<variant base="retroperitoneal fibromatosis" />
</token>
So, testDict.xml is just an example. Two key AE descriptorparameters are"AttributeList" and "FeatureList", which provide the means to mapfrom theXML attributes to the target annotation features. If your targetannotationwere called "DictTerm" and the DictTerm had the features"canonicalForm","semanticClass" and "partOfSpeechTag", using the example dictionarysnippet
shown above, you would set AttributeList to:

      DictCanon
      SemClass
      POS

and you would set FeatureList to:

      canonicalForm
      semanticClass
      partOfSpeechTag
then, when one of the variants is matched in the text, a newDictTerm wouldbe created with its semanticClass set to the value of the SemClassattribute
and its partOfSpeechTag set to the value of the POS attribute.
One important point: matches are only performed against the stringslistedas attributes to the "variant" tag's "base" attribute. It is commonpracticeto have something like the "token" element with something like acanonicalform that is the same as one of the variants, but that is notrequired.
I hope this helps!



On Jun 18, 2008, at 10:06 AM, Ahmed Abdeen Hamed wrote:

Thank Michael! I only recently joined the list so I missed the early
posting. I like this example a lot. I was able to get it to runusing thedocument analyzer from the uimaj-example. I have some questionsthough:Is the testDict.xml just an arbitrary xml file which means anywell-formedxml file should work? How do I get my own xml dictionary files toworkwithout transforming them into the xml format in your testDict.xmlfile?
Is
there documentation for this so that I can understand it on my ownwithout
bugging the entire list?Thanks!
Ahmed

On Tue, Jun 17, 2008 at 8:05 PM, Michael Tanenblatt <
[EMAIL PROTECTED]>
wrote:
As Thilo mentioned in an email from May 19, 2008, I forgot toinclude the
source for uima.tt.TokenAnnotation, but otherwise the code shouldbe
fine.
Additionally, the problem you are seeing is with OffsetTokenizer,which
is
just a sample tokenizer--if you have another, more robusttokenizer, you
don't need this OffsetTokenizer.

Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

Reply via email to