>>> -- What was your english dictionary source? I suppose that there could be some blacklisting in a dictionary creator.
I found these two https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa.txt https://www.mit.edu/~ecprice/wordlist.10000 But these were probably derived from internet usage. I was surprised by some of the words that showed up P. On Tue, Aug 25, 2020 at 9:09 AM Finan, Sean < sean.fi...@childrens.harvard.edu> wrote: > Hi Peter, > > >I'm inferring that there's no way to set the > window size to N and have an exception list of a few items that are of > length < N. > -- As far as I can recall there isn't any such method in the lookup. > > > Join all the 2&3 character gene > terms with the 10,000 most common english words > -- I have seen this done elsewhere, and can't remember if anybody tested > precision gained vs. recall lost. It would be highly related to > note/specialty type. > -- What was your english dictionary source? I suppose that there could be > some blacklisting in a dictionary creator. > > >It reduced the number of items to remove by an order of magnitude. ~4000 > down to ~400 > -- Very nice. > > >performance is a big factor in our project. > -- Yup. > > > If only the dictionary lookup differentiated between all-caps words and > lower or mixed case ... > > Thanks for sharing your ideas, > Sean > > > > ________________________________________ > From: Peter Abramowitsch <pabramowit...@gmail.com> > Sent: Tuesday, August 25, 2020 11:56 AM > To: dev@ctakes.apache.org > Subject: Re: Question about window size in term lookup [EXTERNAL] > > * External Email - Caution * > > > Thanks Sean. A lot of good ideas. I hadn't even been thinking of > post-filtering, but that's a very viable approach. Something like using > tweezers to remove a splinter instead of removing them from all the pieces > of wood you might encounter. I like how you use the functor approach on > the filters. > > Yesterday I tried another method too. Join all the 2&3 character gene > terms with the 10,000 most common english words - then take the resulting > list and use it to create a deletion list in the dictionary creation step. > It reduced the number of items to remove by an order of magnitude. ~4000 > down to ~400 > > Deleting it in the dictionary is more painful up front, but more performant > than post filtering, for two obvious reasons, but using your approach and > checking if the # of gene references is > 0, one can choose to filter only > specific notes and that would increase performance again. Unfortunately > performance is a big factor in our project. > > From your response and Kean's I'm inferring that there's no way to set the > window size to N and have an exception list of a few items that are of > length < N. Right? If there were, it would be in the chunker, not the > term lookup. > > Thanks again for your suggestions! > > Peter > > On Tue, Aug 25, 2020 at 5:50 AM Finan, Sean < > sean.fi...@childrens.harvard.edu> wrote: > > > I think that Kean is correct. I usually create an annotator that removes > > terms that I don't want. It is usually fairly easy. > > > > final Predicate<IdentifiedAnnotation> is2char > > = a -> a.getCoveredText().length() == 2; > > > > final String geneTui = SemanticTui.getTui( "Gene or Genome" > ).name(); > > > > OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui ) > > .stream() > > .filter( is2char ) > > .forEach( Annotation::removeFromIndexes ); > > > > > > Or, if you want to grab a few that aren't specifically "Gene" but are in > > the same semantic group (without looking it up in class SemanticGroup), > and > > in the HGNC vocabulary : > > > > final Class<? extends IdentifiedAnnotation> geneClass > > = SemanticTui.getTui( "Gene or Genome" ) > > .getGroup() > > .getCtakesClass(); > > > > final Predicate<IdentifiedAnnotation> isHgnc > > = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey( > > "hgnc" ); > > > > JCasUtil.select( jCas, geneClass ) > > .stream() > > .filter( is2char ) > > .filter( isHgnc ) > > .forEach( Annotation::removeFromIndexes ); > > > > > > "hgnc" may need to be "HGNC" ... and will only exist if you stored the > > HGNC codes in your dictionary. > > > > > > Or you can do it focusing on what you do want. > > > > final Collection<SemanticGroup> WANTED_GROUP = EnumSet.of( > > SemanticGroup.DRUG, SemanticGroup.LAB ); > > > > final Predicate<IdentifiedAnnotation> isTrashGroup > > = a -> SemanticGroup.getGroups( a ) > > .stream() > > .noneMatch( WANTED_GROUP::contains ); > > > > JCasUtil.select( jCas, IdentifiedAnnotation.class ) > > .stream() > > .filter( is2char ) > > .filter( isTrashGroup ) > > .forEach( Annotation::removeFromIndexes ); > > > > Or if you want to cover all combinations that aren't all uppercase: > > > > final Predicate<IdentifiedAnnotation> notCaps > > = a -> a.getCoveredText() > > .chars() > > .anyMatch( Character::isLowerCase ); > > > > JCasUtil.select( jCas, IdentifiedAnnotation.class ) > > .stream() > > .filter( is2char ) > > .filter( notCaps ) > > .forEach( Annotation::removeFromIndexes ); > > > > Or mix and modify. For instance, ignore character length but Tui = Gene > > and the text is not all caps. > > > > Sometimes I enjoy mocking up code ... > > > > Sean > > > > ________________________________________ > > From: Kean Kaufmann <k...@recordsone.com> > > Sent: Monday, August 24, 2020 9:35 PM > > To: dev@ctakes.apache.org > > Subject: Re: Question about window size in term lookup [EXTERNAL] > > > > * External Email - Caution * > > > > > > > > > > my question is whether there's a place where one can register specific > > two > > > character terms, for example BP or PT which will be found even with a > > > window size set to three. > > > > > > My brute-force approach is pretty brutal: Change the window size to two, > > annotate terms, then remove all two-letter annotations except the very > few > > I'm interested in. > > > > On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch < > > pabramowit...@gmail.com> > > wrote: > > > > > Hello all > > > > > > Is there a mechanism, a lookup file, etc which overrides the window > size > > > set on the term annotator or the chunker. Changing the window size > from > > > the default of 3 to 2 opens the floodgate to false acronym annotations. > > So > > > my question is whether there's a place where one can register specific > > two > > > character terms, for example BP or PT which will be found even with a > > > window size set to three. > > > > > > A similar question about Genes. On adding the HGNC vocabulary I > notice > > > that there are many thousands of aliases for genes which overlap other > > > common acronyms and english words such as trip, spring, plan, bed, yes, > > > rip, prn etc. I'm not sure if these aliases are ever used. So I > > created > > > a sed script with 4000 regex expressions to remove the 2 and 3 letter > > gene > > > synonyms from a script file. I will only suppress the 4 letter > synonyms > > > manually where they cause trouble. But does anyone have a more > > elegant > > > solution? > > > > > > Peter > > > > > >