Hi David,

You can look into the ROUGE package which does n-gram based matches.
https://huggingface.co/spaces/evaluate-metric/rouge

You will find actual n-gram computation code here
https://github.com/google-research/google-research/blob/master/rouge/rouge_scorer.py

Also, Fuzzy string matching is another efficient approach to do substring
matching https://github.com/seatgeek/thefuzz

Thanks,
Mousumi


On Fri, Jun 23, 2023 at 9:39 AM David Beauchamp via Corpora <
corpora@list.elra.info> wrote:

> Hi all, I'm doing analysis on a corpus on tweets from institutions.
> Regarding analysis of n-grams, it is quite unusual in that there are many
> repeated exact tweets, or very similar tweets, leading to long super
> strings of often 9 or 10 or more words together.  Naturally this makes
> accurate counting and classifying difficult due to the overlapping
> substrings.  Does anyone know of any approaches or software which can count
> and classify n-grams in such circumstances?  I am aware of approaches
> outlined by Buerki (2017) and O'Donnell (2011), but these do not seem
> practical due to the excessive length of the n-grams in the corpus.  Does
> anyone know of any accessible methods or packages?
>
> Any input much appreciated.
> _______________________________________________
> Corpora mailing list -- corpora@list.elra.info
> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
> To unsubscribe send an email to corpora-le...@list.elra.info
>
_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-le...@list.elra.info

Reply via email to