Re: [basex-talk] grouping by fuzzy match?

2020-11-12 Thread Hans-Juergen Rennau
Graydon, spread the word! Am Donnerstag, 12. November 2020, 13:59:12 MEZ hat Graydon Folgendes geschrieben: On Thu, Nov 12, 2020 at 11:58:29AM +0100, Christian Grün scripsit: > Gerrit has already mentioned fingerprinting techniques. If your time > is limited, it may be sufficient to ap

Re: [basex-talk] grouping by fuzzy match?

2020-11-12 Thread Graydon
On Thu, Nov 12, 2020 at 11:58:29AM +0100, Christian Grün scripsit: > Gerrit has already mentioned fingerprinting techniques. If your time > is limited, it may be sufficient to apply full-text tokenization and > Soundex to your strings: > > let $get-fuzzy-match-value := function($x) { > $x > =>

Re: [basex-talk] grouping by fuzzy match?

2020-11-12 Thread Graydon
On Thu, Nov 12, 2020 at 09:30:47AM +0100, Victor / tokiop scripsit: > Hello Graydon, > > These blogposts discuss various algorithms to find near-duplicate documents, > performance, and xquery (marklogic dialect) implementations : > > https://stuartmyles.blogspot.com/2012/10/longest-common-substr

Re: [basex-talk] grouping by fuzzy match?

2020-11-12 Thread Graydon
On Thu, Nov 12, 2020 at 01:21:56AM +0100, Imsieke, Gerrit, le-tex scripsit: > Maybe OpenRefine and particularly its clustering feature [1] can be useful. > I don't have any first-hand experience with it though. > > [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth I shall be c

Re: [basex-talk] grouping by fuzzy match?

2020-11-12 Thread Christian Grün
Hi Grayon, Gerrit has already mentioned fingerprinting techniques. If your time is limited, it may be sufficient to apply full-text tokenization and Soundex to your strings: let $get-fuzzy-match-value := function($x) { $x => ft:tokenize(map { 'stemming': true() }) => distinct-values() =>

Re: [basex-talk] grouping by fuzzy match?

2020-11-12 Thread Victor / tokiop
Hello Graydon, These blogposts discuss various algorithms to find near-duplicate documents, performance, and xquery (marklogic dialect) implementations : https://stuartmyles.blogspot.com/2012/10/longest-common-substring-in-xquery-part_9.html https://stuartmyles.blogspot.com/2012/10/longest-commo