There is some in-depth discussion about this in the UIMA User mailing
list--check the archives. The subject line was "Any interest in this
as an open source project?", and it was from May 2008 or possibly
started at the end of April.
On Jun 18, 2008, at 12:33 PM, Ahmed Abdeen Hamed wrote:
Thanks for the response. I am still not sure about some aspects of
it. I
just found out that the UIMA framework has this following
DictionaryAnnotator feature:
http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/DictionaryAnnotator/doc/pdf/DictionaryAnnotatorUserGuide.pdf
This is similar to what the ConceptMapper doing. Is there any
advantage over
the DictionaryAnnotator?
Thank you!
Ahmed
On Wed, Jun 18, 2008 at 10:23 AM, Michael Tanenblatt <
[EMAIL PROTECTED]> wrote:
My original message regarding this talks some about the dictionary
format.
I am in the process o writing a paper describing the whole of
ConceptMapper,
but that is not yet done. Here is what I wrote before:
The structure of the dictionary itself is quite flexible. Entries
can have
any number of variants (synonyms), and arbitrary features can be
associated
with dictionary entries. Individual variants inherit features from
parent
token (i.e., the canonical from), but can override them or add
additional
features. In the following sample dictionary entry, there are 5
variants
of
the canonical form, and as described earlier, each inherits the
SemClass
and POS attributes from the canonical form, with the exception of
the
variant "mesenteric fibromatosis (c48.1)", which overrides the
value of
the
SemClass attribute (this is somewhat of a contrived example, just
to make
that point):
<token canonical="abdominal fibromatosis" SemClass="Diagnosis"
POS="NN">
<variant base="abdominal fibromatosis" />
<variant base="abdominal desmoid" />
<variant base="mesenteric fibromatosis (c48.1)"
SemClass="Diagnosis-Site" />
<variant base="mesenteric fibromatosis" />
<variant base="retroperitoneal fibromatosis" />
</token>
So, testDict.xml is just an example. Two key AE descriptor
parameters are
"AttributeList" and "FeatureList", which provide the means to map
from the
XML attributes to the target annotation features. If your target
annotation
were called "DictTerm" and the DictTerm had the features
"canonicalForm",
"semanticClass" and "partOfSpeechTag", using the example dictionary
snippet
shown above, you would set AttributeList to:
DictCanon
SemClass
POS
and you would set FeatureList to:
canonicalForm
semanticClass
partOfSpeechTag
then, when one of the variants is matched in the text, a new
DictTerm would
be created with its semanticClass set to the value of the SemClass
attribute
and its partOfSpeechTag set to the value of the POS attribute.
One important point: matches are only performed against the strings
listed
as attributes to the "variant" tag's "base" attribute. It is common
practice
to have something like the "token" element with something like a
canonical
form that is the same as one of the variants, but that is not
required.
I hope this helps!
On Jun 18, 2008, at 10:06 AM, Ahmed Abdeen Hamed wrote:
Thank Michael! I only recently joined the list so I missed the early
posting. I like this example a lot. I was able to get it to run
using the
document analyzer from the uimaj-example. I have some questions
though:
Is the testDict.xml just an arbitrary xml file which means any
well-formed
xml file should work? How do I get my own xml dictionary files to
work
without transforming them into the xml format in your testDict.xml
file?
Is
there documentation for this so that I can understand it on my own
without
bugging the entire list?Thanks!
Ahmed
On Tue, Jun 17, 2008 at 8:05 PM, Michael Tanenblatt <
[EMAIL PROTECTED]>
wrote:
As Thilo mentioned in an email from May 19, 2008, I forgot to
include the
source for uima.tt.TokenAnnotation, but otherwise the code should
be
fine.
Additionally, the problem you are seeing is with OffsetTokenizer,
which
is
just a sample tokenizer--if you have another, more robust
tokenizer, you
don't need this OffsetTokenizer.