[jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

2008-05-19 Thread Thilo Goetz (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597982#action_12597982
 ] 

Thilo Goetz commented on UIMA-1033:
---

Source code looks mostly fine to me, but doesn't compile.  Source code for 
uima.tt.TokenAnnotation is missing.  Not a big deal, though.

So what are your plans for this code once it's at Apache?  Will you continue to 
support and develop this code?


 ConceptMapper--a highly configurable, token-based dictionary lookup UIMA 
 component
 --

 Key: UIMA-1033
 URL: https://issues.apache.org/jira/browse/UIMA-1033
 Project: UIMA
  Issue Type: New Feature
  Components: Sandbox
 Environment: Java 5
Reporter: Michael Tanenblatt
Priority: Minor
 Attachments: conceptMapper.zip, conceptMapper.zip.md5

   Original Estimate: 24h
  Remaining Estimate: 24h

 ConceptMapper is a token-based dictionary lookup UIMA component. It was
 designed specifically to allow any external tokenizer that is a UIMA
 component to be used to tokenize its dictionary. Using the same tokenizer
 on both the dictionary and for subsequent text processing prevents
 situations where a particular dictionary entry is not found, though it
 exists, because it was tokenized differently than the text being processed.
 ConceptMapper is highly configurable, in terms of:
  * the way dictionary entries are mapped to resultant annotations
  * the way input documents are processed
  * the availability of multiple lookup strategies
  * its various output options.
 Additionally, a set of post-processing filters are supplied, as well as an
 interface to easily create new filters. This allows for overgenerating
 results during the lookup phase, if so desired, then reducing the result
 set according to particular rules.
 More details:
 The structure of the dictionary itself is quite flexible. Entries can have
 any number of variants (synonyms), and arbitrary features can be associated
 with dictionary entries. Individual variants inherit features from parent
 token (i.e., the canonical from), but can override them or add additional
 features. In the following sample dictionary entry, there are 5 variants of
 the canonical form, and as described earlier, each inherits the SemClass
 and POS attributes from the canonical form, with the exception of the
 variant mesenteric fibromatosis (c48.1), which overrides the value of the
 SemClass attribute (this is somewhat of a contrived example, just to make
 that point):
 token canonical=abdominal fibromatosis SemClass=Diagnosis POS=NN
variant base=abdominal fibromatosis /
variant base=abdominal desmoid /
variant base=mesenteric fibromatosis (c48.1)
 SemClass=Diagnosis-Site /
variant base=mesenteric fibromatosis /
variant base=retroperitoneal fibromatosis /
 /token
 Input tokens are processed one span at a time, where both the token and
 span (usually a sentence) annotation type are configurable. Additionally,
 the particular feature of the token annotation to use for lookups can be
 specified, otherwise its covered text is used. Other input configuration
 settings are whether to use case sensitive matching, an optional class name
 of a stemmer to apply to the tokens, and a list of stop words to to ignore
 during lookup. One additional input control mechanism is the ability to
 skip tokens during lookups based on particular feature values. In this way,
 it is easy to skip, for example, all tokens with particular part of speech
 tags, or with some previously computed semantic class.
 Output is in the form of new annotations, and the type of resulting
 annotations can be specified in a descriptor file. The mapping from
 dictionary entry attributes to the result annotation features can also be
 specified. Additionally, a string containing the matched text, a list of
 matched tokens, and the span enclosing the match can be specified to be set
 in the result annotations. It is also possible to indicate dictionary
 attributes to write back into each of the matched tokens.
 Dictionary lookup is controlled by three parameters in the descriptor, one
 of which allows for order-independent lookup (i.e., A B == B A), another
 togles between finding only the longest match vs. finding all possible
 matches. The final parameter specifies the search strategy, of which there
 are three. The default search strategy only considers contiguous tokens
 (not including tokens frm the stop word list or otherwise skipped tokens),
 and then begins the subsequent search after the longest match. The second
 strategy allows for ignoring non-matching tokens, allowing for disjoint
 matches, so that a dictionary entry of
 A C
 would match against the text
 A B C
 As with the default search strategy, the 

[jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

2008-05-19 Thread Marshall Schor (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597983#action_12597983
 ] 

Marshall Schor commented on UIMA-1033:
--

Requested Software Grant - awaiting confirmation of receipt of same before 
loading into SVN.

 ConceptMapper--a highly configurable, token-based dictionary lookup UIMA 
 component
 --

 Key: UIMA-1033
 URL: https://issues.apache.org/jira/browse/UIMA-1033
 Project: UIMA
  Issue Type: New Feature
  Components: Sandbox
 Environment: Java 5
Reporter: Michael Tanenblatt
Priority: Minor
 Attachments: conceptMapper.zip, conceptMapper.zip.md5

   Original Estimate: 24h
  Remaining Estimate: 24h

 ConceptMapper is a token-based dictionary lookup UIMA component. It was
 designed specifically to allow any external tokenizer that is a UIMA
 component to be used to tokenize its dictionary. Using the same tokenizer
 on both the dictionary and for subsequent text processing prevents
 situations where a particular dictionary entry is not found, though it
 exists, because it was tokenized differently than the text being processed.
 ConceptMapper is highly configurable, in terms of:
  * the way dictionary entries are mapped to resultant annotations
  * the way input documents are processed
  * the availability of multiple lookup strategies
  * its various output options.
 Additionally, a set of post-processing filters are supplied, as well as an
 interface to easily create new filters. This allows for overgenerating
 results during the lookup phase, if so desired, then reducing the result
 set according to particular rules.
 More details:
 The structure of the dictionary itself is quite flexible. Entries can have
 any number of variants (synonyms), and arbitrary features can be associated
 with dictionary entries. Individual variants inherit features from parent
 token (i.e., the canonical from), but can override them or add additional
 features. In the following sample dictionary entry, there are 5 variants of
 the canonical form, and as described earlier, each inherits the SemClass
 and POS attributes from the canonical form, with the exception of the
 variant mesenteric fibromatosis (c48.1), which overrides the value of the
 SemClass attribute (this is somewhat of a contrived example, just to make
 that point):
 token canonical=abdominal fibromatosis SemClass=Diagnosis POS=NN
variant base=abdominal fibromatosis /
variant base=abdominal desmoid /
variant base=mesenteric fibromatosis (c48.1)
 SemClass=Diagnosis-Site /
variant base=mesenteric fibromatosis /
variant base=retroperitoneal fibromatosis /
 /token
 Input tokens are processed one span at a time, where both the token and
 span (usually a sentence) annotation type are configurable. Additionally,
 the particular feature of the token annotation to use for lookups can be
 specified, otherwise its covered text is used. Other input configuration
 settings are whether to use case sensitive matching, an optional class name
 of a stemmer to apply to the tokens, and a list of stop words to to ignore
 during lookup. One additional input control mechanism is the ability to
 skip tokens during lookups based on particular feature values. In this way,
 it is easy to skip, for example, all tokens with particular part of speech
 tags, or with some previously computed semantic class.
 Output is in the form of new annotations, and the type of resulting
 annotations can be specified in a descriptor file. The mapping from
 dictionary entry attributes to the result annotation features can also be
 specified. Additionally, a string containing the matched text, a list of
 matched tokens, and the span enclosing the match can be specified to be set
 in the result annotations. It is also possible to indicate dictionary
 attributes to write back into each of the matched tokens.
 Dictionary lookup is controlled by three parameters in the descriptor, one
 of which allows for order-independent lookup (i.e., A B == B A), another
 togles between finding only the longest match vs. finding all possible
 matches. The final parameter specifies the search strategy, of which there
 are three. The default search strategy only considers contiguous tokens
 (not including tokens frm the stop word list or otherwise skipped tokens),
 and then begins the subsequent search after the longest match. The second
 strategy allows for ignoring non-matching tokens, allowing for disjoint
 matches, so that a dictionary entry of
 A C
 would match against the text
 A B C
 As with the default search strategy, the subsequent search begins after the
 longest match. The final search strategy is identical to the previous,
 except that subsequent searches begin one token 

Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

2008-05-19 Thread Thilo Goetz

Shouldn't we have a vote first?  We shouldn't make Michael
go through the trouble of obtaining a sw grant just to turn
down the donation.

--Thilo

Marshall Schor (JIRA) wrote:
[ https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597983#action_12597983 ] 


Marshall Schor commented on UIMA-1033:
--

Requested Software Grant - awaiting confirmation of receipt of same before 
loading into SVN.


ConceptMapper--a highly configurable, token-based dictionary lookup UIMA 
component
--

Key: UIMA-1033
URL: https://issues.apache.org/jira/browse/UIMA-1033
Project: UIMA
 Issue Type: New Feature
 Components: Sandbox
Environment: Java 5
   Reporter: Michael Tanenblatt
   Priority: Minor
Attachments: conceptMapper.zip, conceptMapper.zip.md5

  Original Estimate: 24h
 Remaining Estimate: 24h

ConceptMapper is a token-based dictionary lookup UIMA component. It was
designed specifically to allow any external tokenizer that is a UIMA
component to be used to tokenize its dictionary. Using the same tokenizer
on both the dictionary and for subsequent text processing prevents
situations where a particular dictionary entry is not found, though it
exists, because it was tokenized differently than the text being processed.
ConceptMapper is highly configurable, in terms of:
 * the way dictionary entries are mapped to resultant annotations
 * the way input documents are processed
 * the availability of multiple lookup strategies
 * its various output options.
Additionally, a set of post-processing filters are supplied, as well as an
interface to easily create new filters. This allows for overgenerating
results during the lookup phase, if so desired, then reducing the result
set according to particular rules.
More details:
The structure of the dictionary itself is quite flexible. Entries can have
any number of variants (synonyms), and arbitrary features can be associated
with dictionary entries. Individual variants inherit features from parent
token (i.e., the canonical from), but can override them or add additional
features. In the following sample dictionary entry, there are 5 variants of
the canonical form, and as described earlier, each inherits the SemClass
and POS attributes from the canonical form, with the exception of the
variant mesenteric fibromatosis (c48.1), which overrides the value of the
SemClass attribute (this is somewhat of a contrived example, just to make
that point):
token canonical=abdominal fibromatosis SemClass=Diagnosis POS=NN
   variant base=abdominal fibromatosis /
   variant base=abdominal desmoid /
   variant base=mesenteric fibromatosis (c48.1)
SemClass=Diagnosis-Site /
   variant base=mesenteric fibromatosis /
   variant base=retroperitoneal fibromatosis /
/token
Input tokens are processed one span at a time, where both the token and
span (usually a sentence) annotation type are configurable. Additionally,
the particular feature of the token annotation to use for lookups can be
specified, otherwise its covered text is used. Other input configuration
settings are whether to use case sensitive matching, an optional class name
of a stemmer to apply to the tokens, and a list of stop words to to ignore
during lookup. One additional input control mechanism is the ability to
skip tokens during lookups based on particular feature values. In this way,
it is easy to skip, for example, all tokens with particular part of speech
tags, or with some previously computed semantic class.
Output is in the form of new annotations, and the type of resulting
annotations can be specified in a descriptor file. The mapping from
dictionary entry attributes to the result annotation features can also be
specified. Additionally, a string containing the matched text, a list of
matched tokens, and the span enclosing the match can be specified to be set
in the result annotations. It is also possible to indicate dictionary
attributes to write back into each of the matched tokens.
Dictionary lookup is controlled by three parameters in the descriptor, one
of which allows for order-independent lookup (i.e., A B == B A), another
togles between finding only the longest match vs. finding all possible
matches. The final parameter specifies the search strategy, of which there
are three. The default search strategy only considers contiguous tokens
(not including tokens frm the stop word list or otherwise skipped tokens),
and then begins the subsequent search after the longest match. The second
strategy allows for ignoring non-matching tokens, allowing for disjoint
matches, so that a dictionary entry of
A C
would match against the text
A B C
As with the default search strategy, the subsequent search begins after the
longest match. The final 

[jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

2008-05-19 Thread Michael Tanenblatt (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597986#action_12597986
 ] 

Michael Tanenblatt commented on UIMA-1033:
--

Sorry about that missing uima.tt.TokenAnnotation--I must have done that in an 
overexuberant fit of cleaning!
As to future plans: we use ConceptMapper extensively in our projects, and am 
certainly interested in helping maintain and enhance it as needed, time 
permitting.


 ConceptMapper--a highly configurable, token-based dictionary lookup UIMA 
 component
 --

 Key: UIMA-1033
 URL: https://issues.apache.org/jira/browse/UIMA-1033
 Project: UIMA
  Issue Type: New Feature
  Components: Sandbox
 Environment: Java 5
Reporter: Michael Tanenblatt
Priority: Minor
 Attachments: conceptMapper.zip, conceptMapper.zip.md5

   Original Estimate: 24h
  Remaining Estimate: 24h

 ConceptMapper is a token-based dictionary lookup UIMA component. It was
 designed specifically to allow any external tokenizer that is a UIMA
 component to be used to tokenize its dictionary. Using the same tokenizer
 on both the dictionary and for subsequent text processing prevents
 situations where a particular dictionary entry is not found, though it
 exists, because it was tokenized differently than the text being processed.
 ConceptMapper is highly configurable, in terms of:
  * the way dictionary entries are mapped to resultant annotations
  * the way input documents are processed
  * the availability of multiple lookup strategies
  * its various output options.
 Additionally, a set of post-processing filters are supplied, as well as an
 interface to easily create new filters. This allows for overgenerating
 results during the lookup phase, if so desired, then reducing the result
 set according to particular rules.
 More details:
 The structure of the dictionary itself is quite flexible. Entries can have
 any number of variants (synonyms), and arbitrary features can be associated
 with dictionary entries. Individual variants inherit features from parent
 token (i.e., the canonical from), but can override them or add additional
 features. In the following sample dictionary entry, there are 5 variants of
 the canonical form, and as described earlier, each inherits the SemClass
 and POS attributes from the canonical form, with the exception of the
 variant mesenteric fibromatosis (c48.1), which overrides the value of the
 SemClass attribute (this is somewhat of a contrived example, just to make
 that point):
 token canonical=abdominal fibromatosis SemClass=Diagnosis POS=NN
variant base=abdominal fibromatosis /
variant base=abdominal desmoid /
variant base=mesenteric fibromatosis (c48.1)
 SemClass=Diagnosis-Site /
variant base=mesenteric fibromatosis /
variant base=retroperitoneal fibromatosis /
 /token
 Input tokens are processed one span at a time, where both the token and
 span (usually a sentence) annotation type are configurable. Additionally,
 the particular feature of the token annotation to use for lookups can be
 specified, otherwise its covered text is used. Other input configuration
 settings are whether to use case sensitive matching, an optional class name
 of a stemmer to apply to the tokens, and a list of stop words to to ignore
 during lookup. One additional input control mechanism is the ability to
 skip tokens during lookups based on particular feature values. In this way,
 it is easy to skip, for example, all tokens with particular part of speech
 tags, or with some previously computed semantic class.
 Output is in the form of new annotations, and the type of resulting
 annotations can be specified in a descriptor file. The mapping from
 dictionary entry attributes to the result annotation features can also be
 specified. Additionally, a string containing the matched text, a list of
 matched tokens, and the span enclosing the match can be specified to be set
 in the result annotations. It is also possible to indicate dictionary
 attributes to write back into each of the matched tokens.
 Dictionary lookup is controlled by three parameters in the descriptor, one
 of which allows for order-independent lookup (i.e., A B == B A), another
 togles between finding only the longest match vs. finding all possible
 matches. The final parameter specifies the search strategy, of which there
 are three. The default search strategy only considers contiguous tokens
 (not including tokens frm the stop word list or otherwise skipped tokens),
 and then begins the subsequent search after the longest match. The second
 strategy allows for ignoring non-matching tokens, allowing for disjoint
 matches, so that a dictionary entry of
 A C
 would match against the text
 A B C
 As with the 

[jira] Reopened: (UIMA-1028) UIMA-AS Doc updates from proofreading

2008-05-19 Thread Eddie Epstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eddie Epstein reopened UIMA-1028:
-


Cleanup of error handling description to accurately reflect current reality.

 UIMA-AS Doc updates from proofreading
 -

 Key: UIMA-1028
 URL: https://issues.apache.org/jira/browse/UIMA-1028
 Project: UIMA
  Issue Type: Improvement
  Components: Async Scaleout, Documentation
Affects Versions: 2.2.2AS
Reporter: Marshall Schor
Assignee: Marshall Schor
Priority: Minor
 Fix For: 2.2.2AS




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

2008-05-19 Thread Marshall Schor

Thilo Goetz wrote:

Shouldn't we have a vote first?  We shouldn't make Michael
go through the trouble of obtaining a sw grant just to turn
down the donation.

OK.  -Marshall


--Thilo

Marshall Schor (JIRA) wrote:
[ 
https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597983#action_12597983 
]

Marshall Schor commented on UIMA-1033:
--

Requested Software Grant - awaiting confirmation of receipt of same 
before loading into SVN.


ConceptMapper--a highly configurable, token-based dictionary lookup 
UIMA component
-- 



Key: UIMA-1033
URL: https://issues.apache.org/jira/browse/UIMA-1033
Project: UIMA
 Issue Type: New Feature
 Components: Sandbox
Environment: Java 5
   Reporter: Michael Tanenblatt
   Priority: Minor
Attachments: conceptMapper.zip, conceptMapper.zip.md5

  Original Estimate: 24h
 Remaining Estimate: 24h

ConceptMapper is a token-based dictionary lookup UIMA component. It was
designed specifically to allow any external tokenizer that is a UIMA
component to be used to tokenize its dictionary. Using the same 
tokenizer

on both the dictionary and for subsequent text processing prevents
situations where a particular dictionary entry is not found, though it
exists, because it was tokenized differently than the text being 
processed.

ConceptMapper is highly configurable, in terms of:
 * the way dictionary entries are mapped to resultant annotations
 * the way input documents are processed
 * the availability of multiple lookup strategies
 * its various output options.
Additionally, a set of post-processing filters are supplied, as well 
as an

interface to easily create new filters. This allows for overgenerating
results during the lookup phase, if so desired, then reducing the 
result

set according to particular rules.
More details:
The structure of the dictionary itself is quite flexible. Entries 
can have
any number of variants (synonyms), and arbitrary features can be 
associated
with dictionary entries. Individual variants inherit features from 
parent
token (i.e., the canonical from), but can override them or add 
additional
features. In the following sample dictionary entry, there are 5 
variants of
the canonical form, and as described earlier, each inherits the 
SemClass

and POS attributes from the canonical form, with the exception of the
variant mesenteric fibromatosis (c48.1), which overrides the value 
of the
SemClass attribute (this is somewhat of a contrived example, just to 
make

that point):
token canonical=abdominal fibromatosis SemClass=Diagnosis 
POS=NN

   variant base=abdominal fibromatosis /
   variant base=abdominal desmoid /
   variant base=mesenteric fibromatosis (c48.1)
SemClass=Diagnosis-Site /
   variant base=mesenteric fibromatosis /
   variant base=retroperitoneal fibromatosis /
/token
Input tokens are processed one span at a time, where both the token and
span (usually a sentence) annotation type are configurable. 
Additionally,
the particular feature of the token annotation to use for lookups 
can be
specified, otherwise its covered text is used. Other input 
configuration
settings are whether to use case sensitive matching, an optional 
class name
of a stemmer to apply to the tokens, and a list of stop words to to 
ignore

during lookup. One additional input control mechanism is the ability to
skip tokens during lookups based on particular feature values. In 
this way,
it is easy to skip, for example, all tokens with particular part of 
speech

tags, or with some previously computed semantic class.
Output is in the form of new annotations, and the type of resulting
annotations can be specified in a descriptor file. The mapping from
dictionary entry attributes to the result annotation features can 
also be
specified. Additionally, a string containing the matched text, a 
list of
matched tokens, and the span enclosing the match can be specified to 
be set

in the result annotations. It is also possible to indicate dictionary
attributes to write back into each of the matched tokens.
Dictionary lookup is controlled by three parameters in the 
descriptor, one
of which allows for order-independent lookup (i.e., A B == B A), 
another

togles between finding only the longest match vs. finding all possible
matches. The final parameter specifies the search strategy, of which 
there

are three. The default search strategy only considers contiguous tokens
(not including tokens frm the stop word list or otherwise skipped 
tokens),
and then begins the subsequent search after the longest match. The 
second

strategy allows for ignoring non-matching tokens, allowing for disjoint
matches, so that a dictionary entry of
A C
would match against the text
A B C
As with the default 

[VOTE] Accept ConceptMapper into the Sandbox

2008-05-19 Thread Marshall Schor
Michael Tanenblatt has offered to donate ConceptMapper to Apache UIMA.  
There was some discussion on the uima-users list about this. 
Thread: http://markmail.org/message/mfuubh5are7vs5ua


Please vote on whether or not to accept this donation (conditional on 
receiving a software grant for it, of course)


[ ] +1Accept the donation of ConceptMapper into the Apache UIMA sandbox
[ ]  0Don't care
[ ] -1Don't accept this donation - because ...

-Marshall