Hi David, You can look into the ROUGE package which does n-gram based matches. https://huggingface.co/spaces/evaluate-metric/rouge
You will find actual n-gram computation code here https://github.com/google-research/google-research/blob/master/rouge/rouge_scorer.py Also, Fuzzy string matching is another efficient approach to do substring matching https://github.com/seatgeek/thefuzz Thanks, Mousumi On Fri, Jun 23, 2023 at 9:39 AM David Beauchamp via Corpora < corpora@list.elra.info> wrote: > Hi all, I'm doing analysis on a corpus on tweets from institutions. > Regarding analysis of n-grams, it is quite unusual in that there are many > repeated exact tweets, or very similar tweets, leading to long super > strings of often 9 or 10 or more words together. Naturally this makes > accurate counting and classifying difficult due to the overlapping > substrings. Does anyone know of any approaches or software which can count > and classify n-grams in such circumstances? I am aware of approaches > outlined by Buerki (2017) and O'Donnell (2011), but these do not seem > practical due to the excessive length of the n-grams in the corpus. Does > anyone know of any accessible methods or packages? > > Any input much appreciated. > _______________________________________________ > Corpora mailing list -- corpora@list.elra.info > https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ > To unsubscribe send an email to corpora-le...@list.elra.info >
_______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-le...@list.elra.info