>>  For tweets, if you are interested in up to 10-grams, you could find the
11-grams, and throw away tweets that have an identical 11-gram?

I use 11-grams to eliminate duplicate texts in the 17.5+ billion word NOW
corpus <https://www.english-corpora.org/now/> from English-Corpora.org,
which grows by about 6-8 million words (10,000+ texts) each day. This is
done in SQL Server, which is the backbone
<https://www.english-corpora.org/help/architecture.pdf> for the corpora
from English-Corpora.org <http://english-corpora.org/>. All of the
processing of the texts (including generating URLs, downloading texts,
deletion of duplicates via 11-grams, PoS tagging, insertion into existing
corpus, etc) is done automatically every night using a customized pipeline
that I've created.

Mark Davies


On Fri, Jun 23, 2023 at 8:55 AM Darren Cook via Corpora <
corpora@list.elra.info> wrote:

> > many repeated exact tweets, or very similar tweets, leading to long
> > super strings of often 9 or 10 or more words together.
>
> One approach that came to mind was https://arxiv.org/abs/2112.11446
> where they remove duplicate documents if the 13-gram jaccard similarity
> is over 0.8. (13-grams exclude spaces and punc.)
>
> For tweets, if you are interested in up to 10-grams, you could find the
> 11-grams, and throw away tweets that have an identical 11-gram?
>
> If data set size is the problem for discovering and removing duplicate
> tweets, look into bloom filters.
>
> For a ready-made package, https://docs.dedupe.io/en/latest/ was the one
> that came up a lot in my search just now. (I don't know how it scales,
> though.)
>
> HTH,
> Darren
> _______________________________________________
> Corpora mailing list -- corpora@list.elra.info
> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
> To unsubscribe send an email to corpora-le...@list.elra.info
>


-- 
============================================
Mark Davies
english-corpora.org
mark-davies.org
============================================
_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-le...@list.elra.info

Reply via email to