Hi Peter,

>I'm inferring that there's no way to set the
window size to N and have an exception list of a few items that are of
length < N.
-- As far as I can recall there isn't any such method in the lookup.

> Join all the 2&3 character gene
terms with the 10,000 most common english words
-- I have seen this done elsewhere, and can't remember if anybody tested 
precision gained vs. recall lost.  It would be highly related to note/specialty 
type.
-- What was your english dictionary source?  I suppose that there could be some 
blacklisting in a dictionary creator.

>It reduced the number of items to remove by an order of magnitude.   ~4000
down to ~400
-- Very nice.

>performance is a big factor in our project.
-- Yup.


If only the dictionary lookup differentiated between all-caps words and lower 
or mixed case ...

Thanks for sharing your ideas,
Sean



________________________________________
From: Peter Abramowitsch <pabramowit...@gmail.com>
Sent: Tuesday, August 25, 2020 11:56 AM
To: dev@ctakes.apache.org
Subject: Re: Question about window size in term lookup [EXTERNAL]

* External Email - Caution *


Thanks Sean.  A lot of good ideas.  I hadn't even been thinking of
post-filtering, but that's a very viable approach. Something like using
tweezers to remove a splinter instead of removing them from all the pieces
of wood you might encounter.   I like how you use the functor approach on
the filters.

Yesterday I tried another method too.   Join all the 2&3 character gene
terms with the 10,000 most common english words - then take the resulting
list and use it to create a deletion list in the dictionary creation step.
It reduced the number of items to remove by an order of magnitude.   ~4000
down to ~400

Deleting it in the dictionary is more painful up front, but more performant
than post filtering, for two obvious reasons,  but using your approach and
checking if the # of gene references is > 0, one can choose to filter only
specific notes and that would increase performance again.  Unfortunately
performance is a big factor in our project.

>From your response and Kean's I'm inferring that there's no way to set the
window size to N and have an exception list of a few items that are of
length < N.  Right?  If there were, it would be in the chunker, not the
term lookup.

Thanks again for your suggestions!

Peter

On Tue, Aug 25, 2020 at 5:50 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> I think that Kean is correct.  I usually create an annotator that removes
> terms that I don't want.  It is usually fairly easy.
>
>       final Predicate<IdentifiedAnnotation> is2char
>             = a -> a.getCoveredText().length() == 2;
>
>       final String geneTui = SemanticTui.getTui( "Gene or Genome" ).name();
>
>       OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui )
>                          .stream()
>                          .filter( is2char )
>                          .forEach( Annotation::removeFromIndexes );
>
>
> Or, if you want to grab a few that aren't specifically "Gene" but are in
> the same semantic group (without looking it up in class SemanticGroup), and
> in the HGNC vocabulary :
>
>       final Class<? extends IdentifiedAnnotation> geneClass
>             = SemanticTui.getTui( "Gene or Genome" )
>                          .getGroup()
>                          .getCtakesClass();
>
>       final Predicate<IdentifiedAnnotation> isHgnc
>             = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey(
> "hgnc" );
>
>       JCasUtil.select( jCas, geneClass )
>               .stream()
>               .filter( is2char )
>               .filter( isHgnc )
>               .forEach( Annotation::removeFromIndexes );
>
>
> "hgnc" may need to be "HGNC" ... and will only exist if you stored the
> HGNC codes in your dictionary.
>
>
> Or you can do it focusing on what you do want.
>
>       final Collection<SemanticGroup> WANTED_GROUP = EnumSet.of(
> SemanticGroup.DRUG, SemanticGroup.LAB );
>
>       final Predicate<IdentifiedAnnotation> isTrashGroup
>             = a -> SemanticGroup.getGroups( a )
>                                 .stream()
>                                 .noneMatch( WANTED_GROUP::contains );
>
>       JCasUtil.select( jCas, IdentifiedAnnotation.class )
>               .stream()
>               .filter( is2char )
>               .filter( isTrashGroup )
>               .forEach( Annotation::removeFromIndexes );
>
> Or if you want to cover all combinations that aren't all uppercase:
>
>       final Predicate<IdentifiedAnnotation> notCaps
>             = a -> a.getCoveredText()
>                     .chars()
>                     .anyMatch( Character::isLowerCase );
>
>       JCasUtil.select( jCas, IdentifiedAnnotation.class )
>               .stream()
>               .filter( is2char )
>               .filter( notCaps )
>               .forEach( Annotation::removeFromIndexes );
>
> Or mix and modify.  For instance, ignore character length but  Tui = Gene
> and the text is not all caps.
>
> Sometimes I enjoy mocking up code ...
>
> Sean
>
> ________________________________________
> From: Kean Kaufmann <k...@recordsone.com>
> Sent: Monday, August 24, 2020 9:35 PM
> To: dev@ctakes.apache.org
> Subject: Re: Question about window size in term lookup [EXTERNAL]
>
> * External Email - Caution *
>
>
> >
> > my question is whether there's a place where one can register specific
> two
> > character terms, for example BP or PT which will be found even with a
> > window size set to three.
>
>
> My brute-force approach is pretty brutal: Change the window size to two,
> annotate terms, then remove all two-letter annotations except the very few
> I'm interested in.
>
> On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hello all
> >
> > Is there a mechanism, a lookup file, etc which overrides the window size
> > set on the term annotator or the chunker.   Changing the window size from
> > the default of 3 to 2 opens the floodgate to false acronym annotations.
> So
> > my question is whether there's a place where one can register specific
> two
> > character terms, for example BP or PT which will be found even with a
> > window size set to three.
> >
> > A similar question about Genes.   On adding the HGNC vocabulary I notice
> > that there are many thousands of aliases for genes which overlap other
> > common acronyms and english words such as trip, spring, plan, bed, yes,
> > rip, prn etc.   I'm not sure if these aliases are ever used.   So I
> created
> > a sed script with 4000 regex expressions to remove the 2 and 3 letter
> gene
> > synonyms from a script file.  I will only suppress the 4 letter synonyms
> > manually where they cause trouble.     But does anyone have a  more
> elegant
> > solution?
> >
> > Peter
> >
>

Reply via email to