Sounds like a classical use for the tf–idf measure. For those with no background in information retrieval, see https://en.wikipedia.org/wiki/Tf%E2%80%93idf
cheers stuart -- ...let us be heard from red core to black sky On Sat, 11 Jul 2020 at 06:58, Eric Lease Morgan <emor...@nd.edu> wrote: > > To stop word, or not to stop word? That is the question. > > Seriously, I am working with a team of people to index and analyze a set of > 65,000 - 100,000 full text scientific journal articles, and all of the > articles are on the topic of COVID-19. [1] We have indexed the data set and > we have created subsets of the data, affectionately called "study carrels". > Each study carrel is characterized with a short name and a few > bibliographic-like features. [2] Within each study carrel are a number of > different analyses, such as ngram frequencies, parts-of-speech enumerations, > and topic modeling. > > Each article in each carrel also has a set of "keywords" extracted from it. > These keywords are computed, and for all intents & purposes, the computation > is pretty good. For example, see a set of keywords from a particular carrel. > [3] Unfortunately, many of the study carrels have very very very similar sets > of keywords. Again, if you peruse the set of all the carrels [2] you see the > preponderance of keywords such as "cell", "covid-19", "SARS", and "patient". > These words happen so frequently that they become (almost) meaningless. > > My questions to y'all are, "When and where should I add something like > 'cell', or better yet 'covid-19', to my list of stopwords?" > > > [1] data set of articles - https://www.semanticscholar.org/cord19 > [2] study carrels - https://cord.distantreader.org/carrels/INDEX.HTM > [3] example keywords - > https://cord.distantreader.org/carrels/kaggle-risk-factors/index.htm#keywords > > -- > Eric Morgan