>>> -- What was your english dictionary source?  I suppose that there could
be some blacklisting in a dictionary creator.

I found these two

https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa.txt
https://www.mit.edu/~ecprice/wordlist.10000

But these were probably derived from internet usage.   I was surprised by
some of the words that showed up

P.

On Tue, Aug 25, 2020 at 9:09 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Peter,
>
> >I'm inferring that there's no way to set the
> window size to N and have an exception list of a few items that are of
> length < N.
> -- As far as I can recall there isn't any such method in the lookup.
>
> > Join all the 2&3 character gene
> terms with the 10,000 most common english words
> -- I have seen this done elsewhere, and can't remember if anybody tested
> precision gained vs. recall lost.  It would be highly related to
> note/specialty type.
> -- What was your english dictionary source?  I suppose that there could be
> some blacklisting in a dictionary creator.
>
> >It reduced the number of items to remove by an order of magnitude.   ~4000
> down to ~400
> -- Very nice.
>
> >performance is a big factor in our project.
> -- Yup.
>
>
> If only the dictionary lookup differentiated between all-caps words and
> lower or mixed case ...
>
> Thanks for sharing your ideas,
> Sean
>
>
>
> ________________________________________
> From: Peter Abramowitsch <pabramowit...@gmail.com>
> Sent: Tuesday, August 25, 2020 11:56 AM
> To: dev@ctakes.apache.org
> Subject: Re: Question about window size in term lookup [EXTERNAL]
>
> * External Email - Caution *
>
>
> Thanks Sean.  A lot of good ideas.  I hadn't even been thinking of
> post-filtering, but that's a very viable approach. Something like using
> tweezers to remove a splinter instead of removing them from all the pieces
> of wood you might encounter.   I like how you use the functor approach on
> the filters.
>
> Yesterday I tried another method too.   Join all the 2&3 character gene
> terms with the 10,000 most common english words - then take the resulting
> list and use it to create a deletion list in the dictionary creation step.
> It reduced the number of items to remove by an order of magnitude.   ~4000
> down to ~400
>
> Deleting it in the dictionary is more painful up front, but more performant
> than post filtering, for two obvious reasons,  but using your approach and
> checking if the # of gene references is > 0, one can choose to filter only
> specific notes and that would increase performance again.  Unfortunately
> performance is a big factor in our project.
>
> From your response and Kean's I'm inferring that there's no way to set the
> window size to N and have an exception list of a few items that are of
> length < N.  Right?  If there were, it would be in the chunker, not the
> term lookup.
>
> Thanks again for your suggestions!
>
> Peter
>
> On Tue, Aug 25, 2020 at 5:50 AM Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
> > I think that Kean is correct.  I usually create an annotator that removes
> > terms that I don't want.  It is usually fairly easy.
> >
> >       final Predicate<IdentifiedAnnotation> is2char
> >             = a -> a.getCoveredText().length() == 2;
> >
> >       final String geneTui = SemanticTui.getTui( "Gene or Genome"
> ).name();
> >
> >       OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui )
> >                          .stream()
> >                          .filter( is2char )
> >                          .forEach( Annotation::removeFromIndexes );
> >
> >
> > Or, if you want to grab a few that aren't specifically "Gene" but are in
> > the same semantic group (without looking it up in class SemanticGroup),
> and
> > in the HGNC vocabulary :
> >
> >       final Class<? extends IdentifiedAnnotation> geneClass
> >             = SemanticTui.getTui( "Gene or Genome" )
> >                          .getGroup()
> >                          .getCtakesClass();
> >
> >       final Predicate<IdentifiedAnnotation> isHgnc
> >             = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey(
> > "hgnc" );
> >
> >       JCasUtil.select( jCas, geneClass )
> >               .stream()
> >               .filter( is2char )
> >               .filter( isHgnc )
> >               .forEach( Annotation::removeFromIndexes );
> >
> >
> > "hgnc" may need to be "HGNC" ... and will only exist if you stored the
> > HGNC codes in your dictionary.
> >
> >
> > Or you can do it focusing on what you do want.
> >
> >       final Collection<SemanticGroup> WANTED_GROUP = EnumSet.of(
> > SemanticGroup.DRUG, SemanticGroup.LAB );
> >
> >       final Predicate<IdentifiedAnnotation> isTrashGroup
> >             = a -> SemanticGroup.getGroups( a )
> >                                 .stream()
> >                                 .noneMatch( WANTED_GROUP::contains );
> >
> >       JCasUtil.select( jCas, IdentifiedAnnotation.class )
> >               .stream()
> >               .filter( is2char )
> >               .filter( isTrashGroup )
> >               .forEach( Annotation::removeFromIndexes );
> >
> > Or if you want to cover all combinations that aren't all uppercase:
> >
> >       final Predicate<IdentifiedAnnotation> notCaps
> >             = a -> a.getCoveredText()
> >                     .chars()
> >                     .anyMatch( Character::isLowerCase );
> >
> >       JCasUtil.select( jCas, IdentifiedAnnotation.class )
> >               .stream()
> >               .filter( is2char )
> >               .filter( notCaps )
> >               .forEach( Annotation::removeFromIndexes );
> >
> > Or mix and modify.  For instance, ignore character length but  Tui = Gene
> > and the text is not all caps.
> >
> > Sometimes I enjoy mocking up code ...
> >
> > Sean
> >
> > ________________________________________
> > From: Kean Kaufmann <k...@recordsone.com>
> > Sent: Monday, August 24, 2020 9:35 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Question about window size in term lookup [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > >
> > > my question is whether there's a place where one can register specific
> > two
> > > character terms, for example BP or PT which will be found even with a
> > > window size set to three.
> >
> >
> > My brute-force approach is pretty brutal: Change the window size to two,
> > annotate terms, then remove all two-letter annotations except the very
> few
> > I'm interested in.
> >
> > On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch <
> > pabramowit...@gmail.com>
> > wrote:
> >
> > > Hello all
> > >
> > > Is there a mechanism, a lookup file, etc which overrides the window
> size
> > > set on the term annotator or the chunker.   Changing the window size
> from
> > > the default of 3 to 2 opens the floodgate to false acronym annotations.
> > So
> > > my question is whether there's a place where one can register specific
> > two
> > > character terms, for example BP or PT which will be found even with a
> > > window size set to three.
> > >
> > > A similar question about Genes.   On adding the HGNC vocabulary I
> notice
> > > that there are many thousands of aliases for genes which overlap other
> > > common acronyms and english words such as trip, spring, plan, bed, yes,
> > > rip, prn etc.   I'm not sure if these aliases are ever used.   So I
> > created
> > > a sed script with 4000 regex expressions to remove the 2 and 3 letter
> > gene
> > > synonyms from a script file.  I will only suppress the 4 letter
> synonyms
> > > manually where they cause trouble.     But does anyone have a  more
> > elegant
> > > solution?
> > >
> > > Peter
> > >
> >
>

Reply via email to