[ https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Baessler updated UIMA-1033: ----------------------------------- Component/s: (was: Sandbox) Sandbox-ConceptMapper > ConceptMapper--a highly configurable, token-based dictionary lookup UIMA > component > ---------------------------------------------------------------------------------- > > Key: UIMA-1033 > URL: https://issues.apache.org/jira/browse/UIMA-1033 > Project: UIMA > Issue Type: New Feature > Components: Sandbox-ConceptMapper > Environment: Java 5 > Reporter: Michael Tanenblatt > Assignee: Michael Baessler > Priority: Minor > Fix For: 2.3S > > Attachments: conceptMapper.zip, conceptMapper.zip.md5 > > Original Estimate: 24h > Remaining Estimate: 24h > > ConceptMapper is a token-based dictionary lookup UIMA component. It was > designed specifically to allow any external tokenizer that is a UIMA > component to be used to tokenize its dictionary. Using the same tokenizer > on both the dictionary and for subsequent text processing prevents > situations where a particular dictionary entry is not found, though it > exists, because it was tokenized differently than the text being processed. > ConceptMapper is highly configurable, in terms of: > * the way dictionary entries are mapped to resultant annotations > * the way input documents are processed > * the availability of multiple lookup strategies > * its various output options. > Additionally, a set of post-processing filters are supplied, as well as an > interface to easily create new filters. This allows for overgenerating > results during the lookup phase, if so desired, then reducing the result > set according to particular rules. > More details: > The structure of the dictionary itself is quite flexible. Entries can have > any number of variants (synonyms), and arbitrary features can be associated > with dictionary entries. Individual variants inherit features from parent > token (i.e., the canonical from), but can override them or add additional > features. In the following sample dictionary entry, there are 5 variants of > the canonical form, and as described earlier, each inherits the SemClass > and POS attributes from the canonical form, with the exception of the > variant "mesenteric fibromatosis (c48.1)", which overrides the value of the > SemClass attribute (this is somewhat of a contrived example, just to make > that point): > <token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN"> > <variant base="abdominal fibromatosis" /> > <variant base="abdominal desmoid" /> > <variant base="mesenteric fibromatosis (c48.1)" > SemClass="Diagnosis-Site" /> > <variant base="mesenteric fibromatosis" /> > <variant base="retroperitoneal fibromatosis" /> > </token> > Input tokens are processed one span at a time, where both the token and > span (usually a sentence) annotation type are configurable. Additionally, > the particular feature of the token annotation to use for lookups can be > specified, otherwise its covered text is used. Other input configuration > settings are whether to use case sensitive matching, an optional class name > of a stemmer to apply to the tokens, and a list of stop words to to ignore > during lookup. One additional input control mechanism is the ability to > skip tokens during lookups based on particular feature values. In this way, > it is easy to skip, for example, all tokens with particular part of speech > tags, or with some previously computed semantic class. > Output is in the form of new annotations, and the type of resulting > annotations can be specified in a descriptor file. The mapping from > dictionary entry attributes to the result annotation features can also be > specified. Additionally, a string containing the matched text, a list of > matched tokens, and the span enclosing the match can be specified to be set > in the result annotations. It is also possible to indicate dictionary > attributes to write back into each of the matched tokens. > Dictionary lookup is controlled by three parameters in the descriptor, one > of which allows for order-independent lookup (i.e., A B == B A), another > togles between finding only the longest match vs. finding all possible > matches. The final parameter specifies the search strategy, of which there > are three. The default search strategy only considers contiguous tokens > (not including tokens frm the stop word list or otherwise skipped tokens), > and then begins the subsequent search after the longest match. The second > strategy allows for ignoring non-matching tokens, allowing for disjoint > matches, so that a dictionary entry of > A C > would match against the text > A B C > As with the default search strategy, the subsequent search begins after the > longest match. The final search strategy is identical to the previous, > except that subsequent searches begin one token ahead, instead of after the > previous match. This enables overlapped matching. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.