[Corpora-List] Re: Counting multiple long (9+) n-grams in corpora: request for approaches

Miloš Jakubíček via Corpora Fri, 21 Jul 2023 12:45:25 -0700

Dear David,

you might have solved it meanwhile, but if not:


- if the task is to deduplicate, have a look at onion
<https://corpus.tools/wiki/Onion>
- if you need to count only, you can make a corpus in Sketch Engine to
calculate, we use a suffix array to calculate ngrams up to the length of 20
by default, following:

Yamamoto and Church: Using Suffix Arrays to Compute Term Frequency and
Document Frequency for All Substrings in a Corpus
Computational Linguistics, Volume 27 Issue 1, March 2001, pp 1-30
http://www.aclweb.org/anthology/J01-1001

The interface displays n-grams up to the length 6 (though computed is 20),
let me know if you need to display longer ones too.

Best regards,
Milos Jakubicek

CEO, Lexical Computing
Brno, CZ | Brighton, UK
http://www.lexicalcomputing.com
http://www.sketchengine.eu


On Tue, 27 Jun 2023 at 19:20, Christian Wartena via Corpora <
corpora@list.elra.info> wrote:

> Hello,
> we have used the Apriori-Algorithm to detect long identical text passages (
> https://link.springer.com/chapter/10.1007/978-3-030-86159-9_34). That
> works quite well. I am not sure whether Frieda Jsi published the code, but
> it is quite easy to implement or I can send you the code.
>
> Best
> Christian
> _______________________________________________
> Corpora mailing list -- corpora@list.elra.info
> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
> To unsubscribe send an email to corpora-le...@list.elra.info
>

_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-le...@list.elra.info

[Corpora-List] Re: Counting multiple long (9+) n-grams in corpora: request for approaches

Reply via email to