On Wed, 2 Mar 2022, @lbutlr wrote:

I'm mulling over writing some code to find emails in a maildir that are
duplicates, ish.  That is to say that sometimes the same message
doesn't quite show up as an exact match.  Like some ad company may send
you three identical messages, except they aren't actually EXACTLY
identical, the message-IDs are different, and may the to address quoted
part is different, so normal duplicate finders fail to find them.

Before I start, is this a solved problem?

Not perfectly, and maybe impossible in the general sense.

If you've ever had to anonymize mail by comparing samples sent by a
mailing list provider to 2 different recipients, you can see various
hashes and identifiers that show up in tracking headers and URLs.
Adding customized name labels e.g. "Dear Alfred P. Sloan" or individual
specific information, and this becomes a complex question how different
is different.

If you make some simplifying assumptions (e.g. exact same message body,
same header for From/Sending network or IP/time-range/Subject, you can
do a fairly good job.

Joseph Tam <jtam.h...@gmail.com>

Reply via email to