I've updated my dump processing python project to include code for quickly
detecting identity reverts from XML dumps. See
https://bitbucket.org/halfak/wikimedia-utilities for the project and the
process() function at bottom of
It's worth pointing out in our research at PARC, we had also discussed
the possibility of using containment based measure as described in:
On the resemblance and containment of documents, AZ Broder
In the end, we realized that the real issue is that there is no
universal agreement on what is a