My original message regarding this talks some about the dictionary
format. I am in the process o writing a paper describing the whole of
ConceptMapper, but that is not yet done. Here is what I wrote before:
The structure of the dictionary itself is quite flexible. Entries
can have
any number of variants (synonyms), and arbitrary features can be
associated
with dictionary entries. Individual variants inherit features from
parent
token (i.e., the canonical from), but can override them or add
additional
features. In the following sample dictionary entry, there are 5
variants of
the canonical form, and as described earlier, each inherits the
SemClass
and POS attributes from the canonical form, with the exception of the
variant "mesenteric fibromatosis (c48.1)", which overrides the value
of the
SemClass attribute (this is somewhat of a contrived example, just to
make
that point):
<token canonical="abdominal fibromatosis" SemClass="Diagnosis"
POS="NN">
<variant base="abdominal fibromatosis" />
<variant base="abdominal desmoid" />
<variant base="mesenteric fibromatosis (c48.1)"
SemClass="Diagnosis-Site" />
<variant base="mesenteric fibromatosis" />
<variant base="retroperitoneal fibromatosis" />
</token>
So, testDict.xml is just an example. Two key AE descriptor parameters
are "AttributeList" and "FeatureList", which provide the means to map
from the XML attributes to the target annotation features. If your
target annotation were called "DictTerm" and the DictTerm had the
features "canonicalForm", "semanticClass" and "partOfSpeechTag", using
the example dictionary snippet shown above, you would set
AttributeList to:
DictCanon
SemClass
POS
and you would set FeatureList to:
canonicalForm
semanticClass
partOfSpeechTag
then, when one of the variants is matched in the text, a new DictTerm
would be created with its semanticClass set to the value of the
SemClass attribute and its partOfSpeechTag set to the value of the POS
attribute.
One important point: matches are only performed against the strings
listed as attributes to the "variant" tag's "base" attribute. It is
common practice to have something like the "token" element with
something like a canonical form that is the same as one of the
variants, but that is not required.
I hope this helps!
On Jun 18, 2008, at 10:06 AM, Ahmed Abdeen Hamed wrote:
Thank Michael! I only recently joined the list so I missed the early
posting. I like this example a lot. I was able to get it to run
using the
document analyzer from the uimaj-example. I have some questions
though:
Is the testDict.xml just an arbitrary xml file which means any well-
formed
xml file should work? How do I get my own xml dictionary files to work
without transforming them into the xml format in your testDict.xml
file? Is
there documentation for this so that I can understand it on my own
without
bugging the entire list?Thanks!
Ahmed
On Tue, Jun 17, 2008 at 8:05 PM, Michael Tanenblatt <[EMAIL PROTECTED]
>
wrote:
As Thilo mentioned in an email from May 19, 2008, I forgot to
include the
source for uima.tt.TokenAnnotation, but otherwise the code should
be fine.
Additionally, the problem you are seeing is with OffsetTokenizer,
which is
just a sample tokenizer--if you have another, more robust
tokenizer, you
don't need this OffsetTokenizer.