Hi Peter, >I'm inferring that there's no way to set the window size to N and have an exception list of a few items that are of length < N. -- As far as I can recall there isn't any such method in the lookup.
> Join all the 2&3 character gene terms with the 10,000 most common english words -- I have seen this done elsewhere, and can't remember if anybody tested precision gained vs. recall lost. It would be highly related to note/specialty type. -- What was your english dictionary source? I suppose that there could be some blacklisting in a dictionary creator. >It reduced the number of items to remove by an order of magnitude. ~4000 down to ~400 -- Very nice. >performance is a big factor in our project. -- Yup. If only the dictionary lookup differentiated between all-caps words and lower or mixed case ... Thanks for sharing your ideas, Sean ________________________________________ From: Peter Abramowitsch <pabramowit...@gmail.com> Sent: Tuesday, August 25, 2020 11:56 AM To: dev@ctakes.apache.org Subject: Re: Question about window size in term lookup [EXTERNAL] * External Email - Caution * Thanks Sean. A lot of good ideas. I hadn't even been thinking of post-filtering, but that's a very viable approach. Something like using tweezers to remove a splinter instead of removing them from all the pieces of wood you might encounter. I like how you use the functor approach on the filters. Yesterday I tried another method too. Join all the 2&3 character gene terms with the 10,000 most common english words - then take the resulting list and use it to create a deletion list in the dictionary creation step. It reduced the number of items to remove by an order of magnitude. ~4000 down to ~400 Deleting it in the dictionary is more painful up front, but more performant than post filtering, for two obvious reasons, but using your approach and checking if the # of gene references is > 0, one can choose to filter only specific notes and that would increase performance again. Unfortunately performance is a big factor in our project. >From your response and Kean's I'm inferring that there's no way to set the window size to N and have an exception list of a few items that are of length < N. Right? If there were, it would be in the chunker, not the term lookup. Thanks again for your suggestions! Peter On Tue, Aug 25, 2020 at 5:50 AM Finan, Sean < sean.fi...@childrens.harvard.edu> wrote: > I think that Kean is correct. I usually create an annotator that removes > terms that I don't want. It is usually fairly easy. > > final Predicate<IdentifiedAnnotation> is2char > = a -> a.getCoveredText().length() == 2; > > final String geneTui = SemanticTui.getTui( "Gene or Genome" ).name(); > > OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui ) > .stream() > .filter( is2char ) > .forEach( Annotation::removeFromIndexes ); > > > Or, if you want to grab a few that aren't specifically "Gene" but are in > the same semantic group (without looking it up in class SemanticGroup), and > in the HGNC vocabulary : > > final Class<? extends IdentifiedAnnotation> geneClass > = SemanticTui.getTui( "Gene or Genome" ) > .getGroup() > .getCtakesClass(); > > final Predicate<IdentifiedAnnotation> isHgnc > = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey( > "hgnc" ); > > JCasUtil.select( jCas, geneClass ) > .stream() > .filter( is2char ) > .filter( isHgnc ) > .forEach( Annotation::removeFromIndexes ); > > > "hgnc" may need to be "HGNC" ... and will only exist if you stored the > HGNC codes in your dictionary. > > > Or you can do it focusing on what you do want. > > final Collection<SemanticGroup> WANTED_GROUP = EnumSet.of( > SemanticGroup.DRUG, SemanticGroup.LAB ); > > final Predicate<IdentifiedAnnotation> isTrashGroup > = a -> SemanticGroup.getGroups( a ) > .stream() > .noneMatch( WANTED_GROUP::contains ); > > JCasUtil.select( jCas, IdentifiedAnnotation.class ) > .stream() > .filter( is2char ) > .filter( isTrashGroup ) > .forEach( Annotation::removeFromIndexes ); > > Or if you want to cover all combinations that aren't all uppercase: > > final Predicate<IdentifiedAnnotation> notCaps > = a -> a.getCoveredText() > .chars() > .anyMatch( Character::isLowerCase ); > > JCasUtil.select( jCas, IdentifiedAnnotation.class ) > .stream() > .filter( is2char ) > .filter( notCaps ) > .forEach( Annotation::removeFromIndexes ); > > Or mix and modify. For instance, ignore character length but Tui = Gene > and the text is not all caps. > > Sometimes I enjoy mocking up code ... > > Sean > > ________________________________________ > From: Kean Kaufmann <k...@recordsone.com> > Sent: Monday, August 24, 2020 9:35 PM > To: dev@ctakes.apache.org > Subject: Re: Question about window size in term lookup [EXTERNAL] > > * External Email - Caution * > > > > > > my question is whether there's a place where one can register specific > two > > character terms, for example BP or PT which will be found even with a > > window size set to three. > > > My brute-force approach is pretty brutal: Change the window size to two, > annotate terms, then remove all two-letter annotations except the very few > I'm interested in. > > On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch < > pabramowit...@gmail.com> > wrote: > > > Hello all > > > > Is there a mechanism, a lookup file, etc which overrides the window size > > set on the term annotator or the chunker. Changing the window size from > > the default of 3 to 2 opens the floodgate to false acronym annotations. > So > > my question is whether there's a place where one can register specific > two > > character terms, for example BP or PT which will be found even with a > > window size set to three. > > > > A similar question about Genes. On adding the HGNC vocabulary I notice > > that there are many thousands of aliases for genes which overlap other > > common acronyms and english words such as trip, spring, plan, bed, yes, > > rip, prn etc. I'm not sure if these aliases are ever used. So I > created > > a sed script with 4000 regex expressions to remove the 2 and 3 letter > gene > > synonyms from a script file. I will only suppress the 4 letter synonyms > > manually where they cause trouble. But does anyone have a more > elegant > > solution? > > > > Peter > > >