> Would anyone know of any prior art for detection of "short edit distances"?
> (Perhaps even already on CPAN?)
As David & Zefram pointed out, Levenshtein is the classic algorithm for this,
but there are plenty of others; in the SEE ALSO for Text::Levenshtein I’ve
listed at least some of the ones I know of on CPAN:
https://metacpan.org/pod/Text::Levenshtein#SEE-ALSO
A better algorithm for this purpose is the Damerau-Levenshtein edit distance:
Classic Levenshtein counts the number of insertions, deletions, and
substitutions needed to get from one string to the other. Comparing
"Algorithm::SVM" and "Algorithm::VSM” gives an edit distance of 2.
The Damerau variant adds transpositions of adjacent characters. This results in
an edit distance of 1 for the example above, which is how my script found it.
I used Text::Levenshtein::Damerau::XS, because it’s quicker. That’s how I found
the examples I gave yesterday.
I’ll tweak my script to not worry about packages in the same distribution (eg
Acme::Flat::GV and Acme::Flat::HV). Then I just need to get a list of new
packages each day, and I’m just about there :-)
Neil