Dear David, you might have solved it meanwhile, but if not:
- if the task is to deduplicate, have a look at onion <https://corpus.tools/wiki/Onion> - if you need to count only, you can make a corpus in Sketch Engine to calculate, we use a suffix array to calculate ngrams up to the length of 20 by default, following: Yamamoto and Church: Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus Computational Linguistics, Volume 27 Issue 1, March 2001, pp 1-30 http://www.aclweb.org/anthology/J01-1001 The interface displays n-grams up to the length 6 (though computed is 20), let me know if you need to display longer ones too. Best regards, Milos Jakubicek CEO, Lexical Computing Brno, CZ | Brighton, UK http://www.lexicalcomputing.com http://www.sketchengine.eu On Tue, 27 Jun 2023 at 19:20, Christian Wartena via Corpora < corpora@list.elra.info> wrote: > Hello, > we have used the Apriori-Algorithm to detect long identical text passages ( > https://link.springer.com/chapter/10.1007/978-3-030-86159-9_34). That > works quite well. I am not sure whether Frieda Jsi published the code, but > it is quite easy to implement or I can send you the code. > > Best > Christian > _______________________________________________ > Corpora mailing list -- corpora@list.elra.info > https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ > To unsubscribe send an email to corpora-le...@list.elra.info >
_______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-le...@list.elra.info