[jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component
[ https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597982#action_12597982 ] Thilo Goetz commented on UIMA-1033: --- Source code looks mostly fine to me, but doesn't compile. Source code for uima.tt.TokenAnnotation is missing. Not a big deal, though. So what are your plans for this code once it's at Apache? Will you continue to support and develop this code? ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component -- Key: UIMA-1033 URL: https://issues.apache.org/jira/browse/UIMA-1033 Project: UIMA Issue Type: New Feature Components: Sandbox Environment: Java 5 Reporter: Michael Tanenblatt Priority: Minor Attachments: conceptMapper.zip, conceptMapper.zip.md5 Original Estimate: 24h Remaining Estimate: 24h ConceptMapper is a token-based dictionary lookup UIMA component. It was designed specifically to allow any external tokenizer that is a UIMA component to be used to tokenize its dictionary. Using the same tokenizer on both the dictionary and for subsequent text processing prevents situations where a particular dictionary entry is not found, though it exists, because it was tokenized differently than the text being processed. ConceptMapper is highly configurable, in terms of: * the way dictionary entries are mapped to resultant annotations * the way input documents are processed * the availability of multiple lookup strategies * its various output options. Additionally, a set of post-processing filters are supplied, as well as an interface to easily create new filters. This allows for overgenerating results during the lookup phase, if so desired, then reducing the result set according to particular rules. More details: The structure of the dictionary itself is quite flexible. Entries can have any number of variants (synonyms), and arbitrary features can be associated with dictionary entries. Individual variants inherit features from parent token (i.e., the canonical from), but can override them or add additional features. In the following sample dictionary entry, there are 5 variants of the canonical form, and as described earlier, each inherits the SemClass and POS attributes from the canonical form, with the exception of the variant mesenteric fibromatosis (c48.1), which overrides the value of the SemClass attribute (this is somewhat of a contrived example, just to make that point): token canonical=abdominal fibromatosis SemClass=Diagnosis POS=NN variant base=abdominal fibromatosis / variant base=abdominal desmoid / variant base=mesenteric fibromatosis (c48.1) SemClass=Diagnosis-Site / variant base=mesenteric fibromatosis / variant base=retroperitoneal fibromatosis / /token Input tokens are processed one span at a time, where both the token and span (usually a sentence) annotation type are configurable. Additionally, the particular feature of the token annotation to use for lookups can be specified, otherwise its covered text is used. Other input configuration settings are whether to use case sensitive matching, an optional class name of a stemmer to apply to the tokens, and a list of stop words to to ignore during lookup. One additional input control mechanism is the ability to skip tokens during lookups based on particular feature values. In this way, it is easy to skip, for example, all tokens with particular part of speech tags, or with some previously computed semantic class. Output is in the form of new annotations, and the type of resulting annotations can be specified in a descriptor file. The mapping from dictionary entry attributes to the result annotation features can also be specified. Additionally, a string containing the matched text, a list of matched tokens, and the span enclosing the match can be specified to be set in the result annotations. It is also possible to indicate dictionary attributes to write back into each of the matched tokens. Dictionary lookup is controlled by three parameters in the descriptor, one of which allows for order-independent lookup (i.e., A B == B A), another togles between finding only the longest match vs. finding all possible matches. The final parameter specifies the search strategy, of which there are three. The default search strategy only considers contiguous tokens (not including tokens frm the stop word list or otherwise skipped tokens), and then begins the subsequent search after the longest match. The second strategy allows for ignoring non-matching tokens, allowing for disjoint matches, so that a dictionary entry of A C would match against the text A B C As with the default search strategy, the
[jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component
[ https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597983#action_12597983 ] Marshall Schor commented on UIMA-1033: -- Requested Software Grant - awaiting confirmation of receipt of same before loading into SVN. ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component -- Key: UIMA-1033 URL: https://issues.apache.org/jira/browse/UIMA-1033 Project: UIMA Issue Type: New Feature Components: Sandbox Environment: Java 5 Reporter: Michael Tanenblatt Priority: Minor Attachments: conceptMapper.zip, conceptMapper.zip.md5 Original Estimate: 24h Remaining Estimate: 24h ConceptMapper is a token-based dictionary lookup UIMA component. It was designed specifically to allow any external tokenizer that is a UIMA component to be used to tokenize its dictionary. Using the same tokenizer on both the dictionary and for subsequent text processing prevents situations where a particular dictionary entry is not found, though it exists, because it was tokenized differently than the text being processed. ConceptMapper is highly configurable, in terms of: * the way dictionary entries are mapped to resultant annotations * the way input documents are processed * the availability of multiple lookup strategies * its various output options. Additionally, a set of post-processing filters are supplied, as well as an interface to easily create new filters. This allows for overgenerating results during the lookup phase, if so desired, then reducing the result set according to particular rules. More details: The structure of the dictionary itself is quite flexible. Entries can have any number of variants (synonyms), and arbitrary features can be associated with dictionary entries. Individual variants inherit features from parent token (i.e., the canonical from), but can override them or add additional features. In the following sample dictionary entry, there are 5 variants of the canonical form, and as described earlier, each inherits the SemClass and POS attributes from the canonical form, with the exception of the variant mesenteric fibromatosis (c48.1), which overrides the value of the SemClass attribute (this is somewhat of a contrived example, just to make that point): token canonical=abdominal fibromatosis SemClass=Diagnosis POS=NN variant base=abdominal fibromatosis / variant base=abdominal desmoid / variant base=mesenteric fibromatosis (c48.1) SemClass=Diagnosis-Site / variant base=mesenteric fibromatosis / variant base=retroperitoneal fibromatosis / /token Input tokens are processed one span at a time, where both the token and span (usually a sentence) annotation type are configurable. Additionally, the particular feature of the token annotation to use for lookups can be specified, otherwise its covered text is used. Other input configuration settings are whether to use case sensitive matching, an optional class name of a stemmer to apply to the tokens, and a list of stop words to to ignore during lookup. One additional input control mechanism is the ability to skip tokens during lookups based on particular feature values. In this way, it is easy to skip, for example, all tokens with particular part of speech tags, or with some previously computed semantic class. Output is in the form of new annotations, and the type of resulting annotations can be specified in a descriptor file. The mapping from dictionary entry attributes to the result annotation features can also be specified. Additionally, a string containing the matched text, a list of matched tokens, and the span enclosing the match can be specified to be set in the result annotations. It is also possible to indicate dictionary attributes to write back into each of the matched tokens. Dictionary lookup is controlled by three parameters in the descriptor, one of which allows for order-independent lookup (i.e., A B == B A), another togles between finding only the longest match vs. finding all possible matches. The final parameter specifies the search strategy, of which there are three. The default search strategy only considers contiguous tokens (not including tokens frm the stop word list or otherwise skipped tokens), and then begins the subsequent search after the longest match. The second strategy allows for ignoring non-matching tokens, allowing for disjoint matches, so that a dictionary entry of A C would match against the text A B C As with the default search strategy, the subsequent search begins after the longest match. The final search strategy is identical to the previous, except that subsequent searches begin one token
Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component
Shouldn't we have a vote first? We shouldn't make Michael go through the trouble of obtaining a sw grant just to turn down the donation. --Thilo Marshall Schor (JIRA) wrote: [ https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597983#action_12597983 ] Marshall Schor commented on UIMA-1033: -- Requested Software Grant - awaiting confirmation of receipt of same before loading into SVN. ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component -- Key: UIMA-1033 URL: https://issues.apache.org/jira/browse/UIMA-1033 Project: UIMA Issue Type: New Feature Components: Sandbox Environment: Java 5 Reporter: Michael Tanenblatt Priority: Minor Attachments: conceptMapper.zip, conceptMapper.zip.md5 Original Estimate: 24h Remaining Estimate: 24h ConceptMapper is a token-based dictionary lookup UIMA component. It was designed specifically to allow any external tokenizer that is a UIMA component to be used to tokenize its dictionary. Using the same tokenizer on both the dictionary and for subsequent text processing prevents situations where a particular dictionary entry is not found, though it exists, because it was tokenized differently than the text being processed. ConceptMapper is highly configurable, in terms of: * the way dictionary entries are mapped to resultant annotations * the way input documents are processed * the availability of multiple lookup strategies * its various output options. Additionally, a set of post-processing filters are supplied, as well as an interface to easily create new filters. This allows for overgenerating results during the lookup phase, if so desired, then reducing the result set according to particular rules. More details: The structure of the dictionary itself is quite flexible. Entries can have any number of variants (synonyms), and arbitrary features can be associated with dictionary entries. Individual variants inherit features from parent token (i.e., the canonical from), but can override them or add additional features. In the following sample dictionary entry, there are 5 variants of the canonical form, and as described earlier, each inherits the SemClass and POS attributes from the canonical form, with the exception of the variant mesenteric fibromatosis (c48.1), which overrides the value of the SemClass attribute (this is somewhat of a contrived example, just to make that point): token canonical=abdominal fibromatosis SemClass=Diagnosis POS=NN variant base=abdominal fibromatosis / variant base=abdominal desmoid / variant base=mesenteric fibromatosis (c48.1) SemClass=Diagnosis-Site / variant base=mesenteric fibromatosis / variant base=retroperitoneal fibromatosis / /token Input tokens are processed one span at a time, where both the token and span (usually a sentence) annotation type are configurable. Additionally, the particular feature of the token annotation to use for lookups can be specified, otherwise its covered text is used. Other input configuration settings are whether to use case sensitive matching, an optional class name of a stemmer to apply to the tokens, and a list of stop words to to ignore during lookup. One additional input control mechanism is the ability to skip tokens during lookups based on particular feature values. In this way, it is easy to skip, for example, all tokens with particular part of speech tags, or with some previously computed semantic class. Output is in the form of new annotations, and the type of resulting annotations can be specified in a descriptor file. The mapping from dictionary entry attributes to the result annotation features can also be specified. Additionally, a string containing the matched text, a list of matched tokens, and the span enclosing the match can be specified to be set in the result annotations. It is also possible to indicate dictionary attributes to write back into each of the matched tokens. Dictionary lookup is controlled by three parameters in the descriptor, one of which allows for order-independent lookup (i.e., A B == B A), another togles between finding only the longest match vs. finding all possible matches. The final parameter specifies the search strategy, of which there are three. The default search strategy only considers contiguous tokens (not including tokens frm the stop word list or otherwise skipped tokens), and then begins the subsequent search after the longest match. The second strategy allows for ignoring non-matching tokens, allowing for disjoint matches, so that a dictionary entry of A C would match against the text A B C As with the default search strategy, the subsequent search begins after the longest match. The final
[jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component
[ https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597986#action_12597986 ] Michael Tanenblatt commented on UIMA-1033: -- Sorry about that missing uima.tt.TokenAnnotation--I must have done that in an overexuberant fit of cleaning! As to future plans: we use ConceptMapper extensively in our projects, and am certainly interested in helping maintain and enhance it as needed, time permitting. ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component -- Key: UIMA-1033 URL: https://issues.apache.org/jira/browse/UIMA-1033 Project: UIMA Issue Type: New Feature Components: Sandbox Environment: Java 5 Reporter: Michael Tanenblatt Priority: Minor Attachments: conceptMapper.zip, conceptMapper.zip.md5 Original Estimate: 24h Remaining Estimate: 24h ConceptMapper is a token-based dictionary lookup UIMA component. It was designed specifically to allow any external tokenizer that is a UIMA component to be used to tokenize its dictionary. Using the same tokenizer on both the dictionary and for subsequent text processing prevents situations where a particular dictionary entry is not found, though it exists, because it was tokenized differently than the text being processed. ConceptMapper is highly configurable, in terms of: * the way dictionary entries are mapped to resultant annotations * the way input documents are processed * the availability of multiple lookup strategies * its various output options. Additionally, a set of post-processing filters are supplied, as well as an interface to easily create new filters. This allows for overgenerating results during the lookup phase, if so desired, then reducing the result set according to particular rules. More details: The structure of the dictionary itself is quite flexible. Entries can have any number of variants (synonyms), and arbitrary features can be associated with dictionary entries. Individual variants inherit features from parent token (i.e., the canonical from), but can override them or add additional features. In the following sample dictionary entry, there are 5 variants of the canonical form, and as described earlier, each inherits the SemClass and POS attributes from the canonical form, with the exception of the variant mesenteric fibromatosis (c48.1), which overrides the value of the SemClass attribute (this is somewhat of a contrived example, just to make that point): token canonical=abdominal fibromatosis SemClass=Diagnosis POS=NN variant base=abdominal fibromatosis / variant base=abdominal desmoid / variant base=mesenteric fibromatosis (c48.1) SemClass=Diagnosis-Site / variant base=mesenteric fibromatosis / variant base=retroperitoneal fibromatosis / /token Input tokens are processed one span at a time, where both the token and span (usually a sentence) annotation type are configurable. Additionally, the particular feature of the token annotation to use for lookups can be specified, otherwise its covered text is used. Other input configuration settings are whether to use case sensitive matching, an optional class name of a stemmer to apply to the tokens, and a list of stop words to to ignore during lookup. One additional input control mechanism is the ability to skip tokens during lookups based on particular feature values. In this way, it is easy to skip, for example, all tokens with particular part of speech tags, or with some previously computed semantic class. Output is in the form of new annotations, and the type of resulting annotations can be specified in a descriptor file. The mapping from dictionary entry attributes to the result annotation features can also be specified. Additionally, a string containing the matched text, a list of matched tokens, and the span enclosing the match can be specified to be set in the result annotations. It is also possible to indicate dictionary attributes to write back into each of the matched tokens. Dictionary lookup is controlled by three parameters in the descriptor, one of which allows for order-independent lookup (i.e., A B == B A), another togles between finding only the longest match vs. finding all possible matches. The final parameter specifies the search strategy, of which there are three. The default search strategy only considers contiguous tokens (not including tokens frm the stop word list or otherwise skipped tokens), and then begins the subsequent search after the longest match. The second strategy allows for ignoring non-matching tokens, allowing for disjoint matches, so that a dictionary entry of A C would match against the text A B C As with the
[jira] Reopened: (UIMA-1028) UIMA-AS Doc updates from proofreading
[ https://issues.apache.org/jira/browse/UIMA-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eddie Epstein reopened UIMA-1028: - Cleanup of error handling description to accurately reflect current reality. UIMA-AS Doc updates from proofreading - Key: UIMA-1028 URL: https://issues.apache.org/jira/browse/UIMA-1028 Project: UIMA Issue Type: Improvement Components: Async Scaleout, Documentation Affects Versions: 2.2.2AS Reporter: Marshall Schor Assignee: Marshall Schor Priority: Minor Fix For: 2.2.2AS -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component
Thilo Goetz wrote: Shouldn't we have a vote first? We shouldn't make Michael go through the trouble of obtaining a sw grant just to turn down the donation. OK. -Marshall --Thilo Marshall Schor (JIRA) wrote: [ https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597983#action_12597983 ] Marshall Schor commented on UIMA-1033: -- Requested Software Grant - awaiting confirmation of receipt of same before loading into SVN. ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component -- Key: UIMA-1033 URL: https://issues.apache.org/jira/browse/UIMA-1033 Project: UIMA Issue Type: New Feature Components: Sandbox Environment: Java 5 Reporter: Michael Tanenblatt Priority: Minor Attachments: conceptMapper.zip, conceptMapper.zip.md5 Original Estimate: 24h Remaining Estimate: 24h ConceptMapper is a token-based dictionary lookup UIMA component. It was designed specifically to allow any external tokenizer that is a UIMA component to be used to tokenize its dictionary. Using the same tokenizer on both the dictionary and for subsequent text processing prevents situations where a particular dictionary entry is not found, though it exists, because it was tokenized differently than the text being processed. ConceptMapper is highly configurable, in terms of: * the way dictionary entries are mapped to resultant annotations * the way input documents are processed * the availability of multiple lookup strategies * its various output options. Additionally, a set of post-processing filters are supplied, as well as an interface to easily create new filters. This allows for overgenerating results during the lookup phase, if so desired, then reducing the result set according to particular rules. More details: The structure of the dictionary itself is quite flexible. Entries can have any number of variants (synonyms), and arbitrary features can be associated with dictionary entries. Individual variants inherit features from parent token (i.e., the canonical from), but can override them or add additional features. In the following sample dictionary entry, there are 5 variants of the canonical form, and as described earlier, each inherits the SemClass and POS attributes from the canonical form, with the exception of the variant mesenteric fibromatosis (c48.1), which overrides the value of the SemClass attribute (this is somewhat of a contrived example, just to make that point): token canonical=abdominal fibromatosis SemClass=Diagnosis POS=NN variant base=abdominal fibromatosis / variant base=abdominal desmoid / variant base=mesenteric fibromatosis (c48.1) SemClass=Diagnosis-Site / variant base=mesenteric fibromatosis / variant base=retroperitoneal fibromatosis / /token Input tokens are processed one span at a time, where both the token and span (usually a sentence) annotation type are configurable. Additionally, the particular feature of the token annotation to use for lookups can be specified, otherwise its covered text is used. Other input configuration settings are whether to use case sensitive matching, an optional class name of a stemmer to apply to the tokens, and a list of stop words to to ignore during lookup. One additional input control mechanism is the ability to skip tokens during lookups based on particular feature values. In this way, it is easy to skip, for example, all tokens with particular part of speech tags, or with some previously computed semantic class. Output is in the form of new annotations, and the type of resulting annotations can be specified in a descriptor file. The mapping from dictionary entry attributes to the result annotation features can also be specified. Additionally, a string containing the matched text, a list of matched tokens, and the span enclosing the match can be specified to be set in the result annotations. It is also possible to indicate dictionary attributes to write back into each of the matched tokens. Dictionary lookup is controlled by three parameters in the descriptor, one of which allows for order-independent lookup (i.e., A B == B A), another togles between finding only the longest match vs. finding all possible matches. The final parameter specifies the search strategy, of which there are three. The default search strategy only considers contiguous tokens (not including tokens frm the stop word list or otherwise skipped tokens), and then begins the subsequent search after the longest match. The second strategy allows for ignoring non-matching tokens, allowing for disjoint matches, so that a dictionary entry of A C would match against the text A B C As with the default
[VOTE] Accept ConceptMapper into the Sandbox
Michael Tanenblatt has offered to donate ConceptMapper to Apache UIMA. There was some discussion on the uima-users list about this. Thread: http://markmail.org/message/mfuubh5are7vs5ua Please vote on whether or not to accept this donation (conditional on receiving a software grant for it, of course) [ ] +1Accept the donation of ConceptMapper into the Apache UIMA sandbox [ ] 0Don't care [ ] -1Don't accept this donation - because ... -Marshall