On Oct 22, 2020, at 2:25 PM, Edward M. Corrado <ecorr...@ecorrado.us> wrote:
> I have a set of just over 60,000 theses and dissertations abstracts that I > want to automatically create keywords/topics from. Does anyone have any > recommendations for text mining or other tools to start with? I do this sort of thing on a regular basis, and I use a two Python libraries/modules: 1. textacy.ke.scake 2. textacy.ke.yake Textacy is built on top of another library called "spaCy". To use the libraries one: 1. gets a string 2. creates a spaCy doc object from the string 3. applies the scake or yake methods to the object 4. gets back a keyword (or phrase) plus a score Attached is a script which takes a file as input and outputs a tab-delimited stream of keywords/phrases. -- Eric Morgan
#!/usr/bin/env python # txt2keywords.sh - given a file, output a tab-delimited list of keywords # configure TOPN = 0.005 MODEL = 'en_core_web_sm' # require import textacy.preprocessing from textacy.ke.scake import scake from textacy.ke.yake import yake import spacy import os import sys # sanity check if len( sys.argv ) != 2 : sys.stderr.write( 'Usage: ' + sys.argv[ 0 ] + " <file>\n" ) quit() # initialize file = sys.argv[ 1 ] # open the given file and unwrap it with open(file) as f: text = f.read() text = textacy.preprocessing.normalize.normalize_quotation_marks( text ) text = textacy.preprocessing.normalize.normalize_hyphenated_words( text ) text = textacy.preprocessing.normalize.normalize_whitespace( text ) # compute the identifier id = os.path.basename( os.path.splitext( file )[ 0 ] ) # initialize model maximum = len( text ) + 1 model = spacy.load( MODEL, max_length=maximum ) doc = model( text ) # output a header print( "id\tkeyword" ) # track found keywords to avoid duplicates keywords = set() # process and output each keyword with yake, will produce unigrams for keyword, score in ( yake( doc, topn=TOPN ) ) : if keyword not in keywords: print( "\t".join( [ id, keyword ] ) ) keywords.add(keyword) # process and output each keyword with scake, will typically produce keyphrases # removing lemmatization with normalize=None seems to produce better results for keyword, score in ( scake( doc, normalize=None, topn=TOPN ) ) : if keyword not in keywords: print( "\t".join( [ id, keyword ] ) ) keywords.add(keyword) # done exit()