Maybe OpenRefine and particularly its clustering feature [1] can be useful. I don't have any first-hand experience with it though.

[1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

On 12.11.2020 00:57, Graydon Saunders wrote:
Useful keywords; thank you!

Also more of a development effort than this project will support, alas. (Unless someone's willing to provide a pointer to their public release of such a solution, free for commercial use?  Which doesn't seem a whole lot more likely than someone throwing a gold brick through my window.)

On Wed, Nov 11, 2020 at 6:42 PM Imsieke, Gerrit, le-tex <[email protected] <mailto:[email protected]>> wrote:

    This is probably difficult since in BaseX, fuzzy matching is
    implemented
    using the Levenshtein distance between two strings [1]. Therefore
    similarity is a relation between pairs of paragraphs rather than an
    intrinsic property of an individual paragraph.

    You should look for content fingerprinting/clustering techniques.

    [1] https://docs.basex.org/wiki/Full-Text#Fuzzy_Querying


    On 12.11.2020 00:00, Graydon Saunders wrote:
     > Hello --
     >
     > Is there some way to assign the abstraction of a fuzzy match to a
     > variable, so that something like
     >
     > for $x in //p
     >    let $key := get-fuzzy-match-value($x)
     >    group by $key
     >    return <similar-paragraphs>{$x}</similar-paragraphs>
     >
     > would be possible?
     >
     > I'm supposing this is one of those things that's either easy or
    impossible.
     >
     > Thanks!
     > Graydon


Reply via email to