On 02/09/18 13:17, Eric Wong wrote: > In addition to the git object_id (blob SHA-1) and Message-Id > header; it seems necessary to introduce an in-between identifier > for deduplicating which isn't as loose as Message-Id or as > strict as object_id: content_id > > I think a hash of the following raw headers + raw body will > suffice: > > Subject, From, Date, Message-Id, References, To, Cc, > In-Reply-To, MIME-Version, Content-Type, > Content-Disposition, Content-Transfer-Encoding
That's similar to what ARC/DKIM do. E.g. in my mailbox the message has the following headers sealed: h=archived-at:list-post:list-owner:list-subscribe:list-unsubscribe :list-help:list-archive:precedence:list-id:content-disposition :mime-version:message-id:subject:to:from:date :arc-authentication-results; If we trim the arc- and list-specific headers, that's: precedence content-disposition mime-version message-id subject to from date I'm not sure we should care about content-transfer-encoding, because that can be mangled by intermediate MTAs (at least that used to happen all the time in the past -- not sure if it's still the case). > List-Id, X-Mailing-List should be left out so different > readers/lists can share spam removals in cross posts. Note, that mailing lists that modify the Subject header (e.g. to add [mailinglist] identifier) will also be impacted similarly. > (*) I noticed the first Received: header (last hop) is missing > from the cregit sources; but the first remaining Received: > header also includes the identity of the recipient in more > recent mails... I specifically sanitized all Received: headers that didn't say "by vger.kernel.org" because these are donated by individual users and I didn't want to expose their potentially private info. Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation
signature.asc
Description: OpenPGP digital signature