>> For tweets, if you are interested in up to 10-grams, you could find the 11-grams, and throw away tweets that have an identical 11-gram?
I use 11-grams to eliminate duplicate texts in the 17.5+ billion word NOW corpus <https://www.english-corpora.org/now/> from English-Corpora.org, which grows by about 6-8 million words (10,000+ texts) each day. This is done in SQL Server, which is the backbone <https://www.english-corpora.org/help/architecture.pdf> for the corpora from English-Corpora.org <http://english-corpora.org/>. All of the processing of the texts (including generating URLs, downloading texts, deletion of duplicates via 11-grams, PoS tagging, insertion into existing corpus, etc) is done automatically every night using a customized pipeline that I've created. Mark Davies On Fri, Jun 23, 2023 at 8:55 AM Darren Cook via Corpora < corpora@list.elra.info> wrote: > > many repeated exact tweets, or very similar tweets, leading to long > > super strings of often 9 or 10 or more words together. > > One approach that came to mind was https://arxiv.org/abs/2112.11446 > where they remove duplicate documents if the 13-gram jaccard similarity > is over 0.8. (13-grams exclude spaces and punc.) > > For tweets, if you are interested in up to 10-grams, you could find the > 11-grams, and throw away tweets that have an identical 11-gram? > > If data set size is the problem for discovering and removing duplicate > tweets, look into bloom filters. > > For a ready-made package, https://docs.dedupe.io/en/latest/ was the one > that came up a lot in my search just now. (I don't know how it scales, > though.) > > HTH, > Darren > _______________________________________________ > Corpora mailing list -- corpora@list.elra.info > https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ > To unsubscribe send an email to corpora-le...@list.elra.info > -- ============================================ Mark Davies english-corpora.org mark-davies.org ============================================
_______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-le...@list.elra.info