Graydon, spread the word!
Am Donnerstag, 12. November 2020, 13:59:12 MEZ hat Graydon
Folgendes geschrieben:
On Thu, Nov 12, 2020 at 11:58:29AM +0100, Christian Grün scripsit:
> Gerrit has already mentioned fingerprinting techniques. If your time
> is limited, it may be sufficient to ap
On Thu, Nov 12, 2020 at 11:58:29AM +0100, Christian Grün scripsit:
> Gerrit has already mentioned fingerprinting techniques. If your time
> is limited, it may be sufficient to apply full-text tokenization and
> Soundex to your strings:
>
> let $get-fuzzy-match-value := function($x) {
> $x
> =>
On Thu, Nov 12, 2020 at 09:30:47AM +0100, Victor / tokiop scripsit:
> Hello Graydon,
>
> These blogposts discuss various algorithms to find near-duplicate documents,
> performance, and xquery (marklogic dialect) implementations :
>
> https://stuartmyles.blogspot.com/2012/10/longest-common-substr
On Thu, Nov 12, 2020 at 01:21:56AM +0100, Imsieke, Gerrit, le-tex scripsit:
> Maybe OpenRefine and particularly its clustering feature [1] can be useful.
> I don't have any first-hand experience with it though.
>
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
I shall be c
Hi Grayon,
Gerrit has already mentioned fingerprinting techniques. If your time
is limited, it may be sufficient to apply full-text tokenization and
Soundex to your strings:
let $get-fuzzy-match-value := function($x) {
$x
=> ft:tokenize(map { 'stemming': true() })
=> distinct-values()
=>
Hello Graydon,
These blogposts discuss various algorithms to find near-duplicate documents,
performance, and xquery (marklogic dialect) implementations :
https://stuartmyles.blogspot.com/2012/10/longest-common-substring-in-xquery-part_9.html
https://stuartmyles.blogspot.com/2012/10/longest-commo
6 matches
Mail list logo