Hi all, I'm doing analysis on a corpus on tweets from institutions.  Regarding 
analysis of n-grams, it is quite unusual in that there are many repeated exact 
tweets, or very similar tweets, leading to long super strings of often 9 or 10 
or more words together.  Naturally this makes accurate counting and classifying 
difficult due to the overlapping substrings.  Does anyone know of any 
approaches or software which can count and classify n-grams in such 
circumstances?  I am aware of approaches outlined by Buerki (2017) and 
O'Donnell (2011), but these do not seem practical due to the excessive length 
of the n-grams in the corpus.  Does anyone know of any accessible methods or 
packages?

Any input much appreciated.
_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-le...@list.elra.info

Reply via email to