Re: Better Id generator

sebb Sat, 10 Dec 2016 03:47:22 -0800

Another issue I have noticed - From_ mangling.

Exported mbox files must use From_ mangling to be usable as input.
However the commonest method (mboxo) only mangles a plain From_, so
one cannot then distinguish From_ from >From_.
[It looks like the ASF mod_mbox archive use mboxo]


This means that the live message as received by the archiver may not
be the same as the unmangled mbox entry.

The current importer does not unmangle >From_ lines, which means that
re-importation of a live message will change the MID (and the
Permalink)
Likewise once the importer is fixed to do the unmangling,
re-importation of previously imported messages will change the MID
(and the Permalink)
There will still be the issue of >From_ lines in original messages,
but that should hopefully only affect a few messages (quoted lines
have a space after the >)


On 8 December 2016 at 13:25, sebb <[email protected]> wrote:
> I have just discovered that there are some duplicate ezmlm message
> index numbers in the [email protected] archives.
> This means we cannot rely on the number being unique, though the From
> line is almost certainly unique id the date is included.
> But unless the same data is present in all copies of the message,
> hashes derived from it won't be stable.
>
> 200103.mbox:From
> [email protected] Fri
> Mar 09 20:26:49 2001
> 200103.mbox:From
> [email protected] Mon
> Mar 12 12:19:04 2001
> 200103.mbox:From
> [email protected] Mon
> Mar 12 19:09:39 2001
> 200103.mbox:From
> [email protected] Wed
> Mar 28 17:46:40 2001
> 200104.mbox:From
> [email protected] Mon
> Apr 09 19:25:39 2001
> 200105.mbox:From
> [email protected] Tue
> May 01 21:14:21 2001
> 200105.mbox:From
> [email protected] Tue
> May 01 21:59:08 2001
> 200105.mbox:From
> [email protected] Sat
> May 12 22:11:46 2001
> 200105.mbox:From
> [email protected] Mon
> May 21 22:51:04 2001
> 200105.mbox:From
> [email protected] Tue
> May 22 17:12:40 2001
> 200105.mbox:From
> [email protected] Thu
> May 31 06:14:01 2001
> 200106.mbox:From
> [email protected] Fri
> Jun 01 14:31:52 2001
> 200107.mbox:From
> [email protected] Tue
> Jul 17 18:56:30 2001
> 200110.mbox:From
> [email protected] Sun
> Oct 14 03:23:08 2001
> 200111.mbox:From
> [email protected] Fri
> Nov 16 00:13:49 2001
>
> 200210.mbox:From
> [email protected] Thu
> Oct 03 19:20:20 2002
> 200210.mbox:From
> [email protected] Fri
> Oct 04 19:01:01 2002
> 200211.mbox:From
> [email protected] Tue
> Nov 05 20:29:51 2002
> 200211.mbox:From
> [email protected] Wed
> Nov 27 19:39:07 2002
> 200301.mbox:From
> [email protected] Mon
> Jan 20 23:34:23 2003
> 200301.mbox:From
> [email protected] Sun
> Jan 26 08:21:21 2003
> 200302.mbox:From
> [email protected] Mon
> Feb 24 16:35:58 2003
> 200303.mbox:From
> [email protected] Mon
> Mar 03 15:44:33 2003
> 200303.mbox:From
> [email protected] Mon
> Mar 17 15:55:44 2003
> 200305.mbox:From
> [email protected] Wed
> May 28 16:40:34 2003
> 200307.mbox:From
> [email protected] Wed
> Jul 09 12:06:06 2003
> 200307.mbox:From
> [email protected] Fri
> Jul 18 13:48:09 2003
> 200308.mbox:From
> [email protected] Tue
> Aug 05 16:10:28 2003
> 200308.mbox:From
> [email protected] Wed
> Aug 13 10:26:40 2003
> 200308.mbox:From
> [email protected] Fri
> Aug 15 10:21:03 2003
>
> I don't know what happened to create duplicates partway through the sequence.
> The sequence proper started here:
>
> 200201.mbox:From
> [email protected] Fri
> Jan 11 20:30:21 2002
>
> Prior to Jan 2002 AFAICT there are only numbers 32-46 as shown above
> in the 2001 archives.
>
>
> On 16 November 2016 at 21:50, sebb <[email protected]> wrote:
>> Just discovered an issue which affects the Permalinks.
>>
>> If the archiver msgbody() function fails to detect a text message
>> body, it will return null.
>> If the message has an attachment, then the archiver will generate an
>> entry for it.
>> However the short and medium id generators will fail when accessing the body.
>>
>> The mid will then revert to the previously calculated value.
>> This is derived from the list id and the 'archived-at' header.
>> However the archived-at header is added by the archiver itself if necessary.
>> So if the message ever needs to be reloaded from elsewhere, a new
>> mid/Permalink value will likely be generated.
>> The Permalink will stop working unless the original entry is kept.
>>
>> Unfortunately the fallback mids have exactly the same format as the
>> medium generator (sha224@lid).
>> This makes it impossible to determine which Permalinks are affected
>> from the link alone.
>> However any existing mbox entries which have body: null will be using
>> the fallback mid.
>>
>>
>> On 25 October 2016 at 19:50, sebb <[email protected]> wrote:
>>> Unfortunately it appears that Message-Ids are not always generated
>>> (*), so there needs to be an equivalent that can be used.
>>> Furthermore, some emails may have more than one Message-Id [+]
>>>
>>> The original message fields (including full body) are not enough to
>>> uniquely id a message in the mailing list stream.
>>> For that one needs something like a sequence number.
>>> However some early mailboxes don't include the ezmlm sequence number.
>>>
>>> However AFAIK every message that reaches a mailing list must have been
>>> delivered to the mailserver mailbox.
>>> So the mailing list message *instance* should be identifiable using
>>> either the sequence number, or failing that, some or all of the
>>> routing information by which it arrived at the mailbox.
>>>
>>> And one can identify the mail *content* using the Message-Id if
>>> present (failing that, a hash of the full e-mail).
>>>
>>> This needs some more work and some worked example messages.
>>>
>>> Sorry to keep going on about this, but AFAICT currently Pony Mail does
>>> not have an algorithm that satisfies all the requirements for use as
>>> an ES id.
>>>
>>> (*) Automated services and older e-mails don't always have them.
>>> Such mails don't appear in mod_mbox listings though they are present
>>> in the raw mbox files.
>>> [+] mod_mbox appears to use the first Message-Id it finds.
>>>
>>> On 22 October 2016 at 23:02, sebb <[email protected]> wrote:
>>>> On 18 October 2016 at 10:49, Daniel Gruno <[email protected]> wrote:
>>>>> On 10/18/2016 11:43 AM, sebb wrote:
>>>>>> It just occurs to me that there is another aspect to be considered:
>>>>>> list renaming.
>>>>>> This might affect both the unique id and Permalinks.
>>>>>>
>>>>>> If the list id is embedded in the Permalink hash, I think it would be
>>>>>> very difficult to honour existing links.
>>>>>>
>>>>>> This is in contrast with the link format as used by mod_mbox, where
>>>>>> the list id is exposed in the URL, for example:
>>>>>>
>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id>
>>>>>> http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id>
>>>>>>
>>>>>> It would be easy enough to redirect one URL to the other - if necessary.
>>>>>> (In this case, the original lists have been retained)
>>>>>
>>>>> Generally not an issue.
>>>>> when you rename a list in Pony Mail, you keep the original ID and
>>>>> source, you only touch the list value in elasticsearch. permalinks will
>>>>> remain the same, but point to the new list name (and keep the original
>>>>> in the source).
>>>>
>>>> I see, that's probably OK.
>>>>
>>>>> There are of course edge cases where you have to reimport and you force
>>>>> a list ID on the command line, which may change the generated ID.
>>>>
>>>> That could be a problem both for the ES id and the Permalink.
>>>> If the id uses the new LID, it will result in a new MID.
>>>> If the old message is still present, there will be two copies.
>>>> If the old message is deleted, the Permalink will break.
>>>>
>>>> I think we do need to allow for re-imports.
>>>> This makes generation of the MID tricky.
>>>>
>>>>> If we scrap the list id for a moment, we are left with, at least, the
>>>>> following constants:
>>>>>
>>>>> Sender
>>>>> Date
>>>>> Subject
>>>>> Body
>>>>
>>>> (I assume this includes the attachments)
>>>>
>>>>> Message ID
>>>>>
>>>>> The problem with the above is, that an email with the same values in all
>>>>> fields may have been sent to multiple lists, so ideally we'd keep
>>>>> separate copies for each list....which gets tricky without adding the
>>>>> list ID to this. A workaround would be to force the ID generator to
>>>>> always use the original list ID in the source for generating the ID, and
>>>>> only use the forced list ID if no list was found.
>>>>
>>>> The identical message may be sent to the same list twice.
>>>> I've seen this already on some of the lists (I think I mentioned this 
>>>> already).
>>>> It looks like a mail client sometimes duplicates the message.
>>>> Maybe it can also occur if the destination list has an alias and the
>>>> message is sent to the alias as well.
>>>>
>>>> So the list id is not sufficient to identify the specific message.
>>>>
>>>> However mailing list software has to identify bounces.
>>>> This is usually done by adding a unique id to the Return-Path,
>>>> In the case of ezmlm this is a sequence number.
>>>>
>>>> That will be present in mails sent to subscribers, and is present in
>>>> the mbox files (apart from some very early ones).
>>>>
>>>> If present, it should be a constant for a particular message on a
>>>> specific mailing list.
>>>>
>>>>> Thoughts?
>>>>
>>>> The MID needs to be unique to ensure that all messages can be stored OK in 
>>>> ES.
>>>> Unwanted duplicates cause message loss.
>>>>
>>>> The Permalink needs to be persistent.
>>>> If there are multiple Permalinks for the same message that is not a 
>>>> problem.
>>>> If a single Permalink applies to multiple messages, that is not ideal,
>>>> but at least there is a chance of recovering the message.
>>>> But if a Permalink disappears, then it is a big problem.
>>>>
>>>> Although the mod_mbox solution is not perfect, its Permalinks have the
>>>> advantage that the Message-ID and an idea of the List-name can be got
>>>> from the link.
>>>> This means that there is good chance of being able to find the
>>>> matching message(s) in almost any mail archive, should the originals
>>>> be unavailable.
>>>>
>>>> With the current Pony Mail links, if a message is lost from the
>>>> database, AFAICT the only way to recover it is to re-import all the
>>>> messages and hope the same ids are re-generated.
>>>> Since there have been at least two different generator algorithms
>>>> used, and the generator was/is sensitive to the host time zone, this
>>>> will likely be a lengthy process.
>>>> The long links do at least include the list-id, which would reduce the
>>>> work somewhat.
>>>>
>>>> So I think the Permalink needs to be similar to the current mod_mbox 
>>>> solution.
>>>> The MID does not need to be directly related, so long as it is unique
>>>> and stable.
>>>>
>>>>>>
>>>>>> How can list renaming be managed in Pony Mail?
>>>>>> Or is it not allowed?
>>>>>>
>>>>>>
>>>>>> On 14 October 2016 at 00:21, sebb <[email protected]> wrote:
>>>>>>> On 13 October 2016 at 23:14, Daniel Gruno <[email protected]> wrote:
>>>>>>>> On 10/14/2016 12:12 AM, sebb wrote:
>>>>>>>>> On 13 October 2016 at 21:28, sebb <[email protected]> wrote:
>>>>>>>>>> On 7 October 2016 at 00:44, sebb <[email protected]> wrote:
>>>>>>>>>>> The id generator is used to create a key for the message database, 
>>>>>>>>>>> and
>>>>>>>>>>> also to create a Permalink.
>>>>>>>>>>>
>>>>>>>>>>> Therefore, an id generator needs to fulfil the following design 
>>>>>>>>>>> goals
>>>>>>>>>>> as a minimum:
>>>>>>>>>>> A) different messages have different IDs
>>>>>>>>>>> B) the same id is generated if the same message is re-processed
>>>>>>>>>>> C) equivalent messages have the same ID
>>>>>>>>>>>
>>>>>>>>>>> Goal A is needed to ensure that the database can contain every 
>>>>>>>>>>> different message
>>>>>>>>>>> Goal B is needed to ensure that the database can be reloaded from 
>>>>>>>>>>> the
>>>>>>>>>>> original source if necessary
>>>>>>>>>>> Goal C is needed to ensure that the database can be reloaded from an
>>>>>>>>>>> equivalent source, and to ensure that Permalinks are stable.
>>>>>>>>>>>
>>>>>>>>>>> None of the current id generator algorithms meet all of the above 
>>>>>>>>>>> goals.
>>>>>>>>>>>
>>>>>>>>>>> The original and medium generators fail to meet goal A.
>>>>>>>>>>> The full generator fails to meet goal B (and therefore C).
>>>>>>>>>>>
>>>>>>>>>>> A sender can easily generate two messages with identical content; it
>>>>>>>>>>> is important to distinguish these.
>>>>>>>>>>>
>>>>>>>>>>> The Message-Id should help here.
>>>>>>>>>>>
>>>>>>>>>>> Message-ID is supposed to be unique, in practice it may not be, so
>>>>>>>>>>> some additional fields need to be used to create the database id.
>>>>>>>>>>>
>>>>>>>>>>> For mailing lists the Return-Path will normally contain a unique id
>>>>>>>>>>> which is used to identify bounces.
>>>>>>>>>>> In theory this might be sufficient on its own. Indeed the path might
>>>>>>>>>>> be usable without hashing. However the early ASF mailing list 
>>>>>>>>>>> software
>>>>>>>>>>> did not use unique Return-Paths.
>>>>>>>>>>>
>>>>>>>>>>> The existing mod_mbox solution uses a combination of Message-Id plus
>>>>>>>>>>> YYYYMM plus a list identifier. Have there ever been any collisions?
>>>>>>>>>>
>>>>>>>>>> It's certainly possible for mod_mbox to contain multiple messages 
>>>>>>>>>> with
>>>>>>>>>> the same id.
>>>>>>>>>>
>>>>>>>>>> For example:
>>>>>>>>>>
>>>>>>>>>> From 
>>>>>>>>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>>>>>>  Tue Jun  7 20:08:04 2016
>>>>>>>>>> and
>>>>>>>>>> From 
>>>>>>>>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org
>>>>>>>>>>  Tue Jun  7 20:24:24 2016
>>>>>>>>>>
>>>>>>>>>> both have the same id:
>>>>>>>>>>
>>>>>>>>>> Message-Id: <[email protected]>
>>>>>>>>>>
>>>>>>>>>> It's exactly the same message which arrived twice in the same mailing
>>>>>>>>>> list a few seconds apart.
>>>>>>>>>> Maybe it was sent using Bcc as well as To: ?
>>>>>>>>>>
>>>>>>>>>> It's not exactly a collision, but at present mod_mbox is able to 
>>>>>>>>>> store
>>>>>>>>>> both whereas Pony Mail cannot (except if using the full generator,
>>>>>>>>>> which has other problems)
>>>>>>>>>>
>>>>>>>>>> I think it's important to store the full message history in the 
>>>>>>>>>> database.
>>>>>>>>>> For example, if one of the messages bounces, it would be odd if the
>>>>>>>>>> source of the bounce were not in the database.
>>>>>>>>>> Also the message sequences will be incomplete.
>>>>>>>>>> This is the case for lists.a.o, the mbox
>>>>>>>>>>
>>>>>>>>>> https://lists.apache.org/api/[email protected]&date=2016-6
>>>>>>>>>>
>>>>>>>>>> does not have the message sequence number 15575
>>>>>>>>>>
>>>>>>>>>>> How about using the following:
>>>>>>>>>>>
>>>>>>>>>>> Message-Id
>>>>>>>>>>> Date
>>>>>>>>>>> Return-Path
>>>>>>>>>>
>>>>>>>>>> In this case the return-path can be used to distinguish the messages.
>>>>>>>>>
>>>>>>>>> Note that the path depends on where the message is stored:
>>>>>>>>>
>>>>>>>>> Return-Path: 
>>>>>>>>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org>
>>>>>>>>> as against
>>>>>>>>> Return-Path: 
>>>>>>>>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org>
>>>>>>>>>
>>>>>>>>> So it will need adjusting to extract the parts that are the same for
>>>>>>>>> all message sources, whether that is mod_mbox or Pony Mail or sent to
>>>>>>>>> another subscriber.
>>>>>>>>
>>>>>>>>
>>>>>>>> How about we:
>>>>>>>> 1) select a larger list of headers that we know to be the same across
>>>>>>>> the same email sent to different MTAs, and combine them for the ID
>>>>>>>> generation.
>>>>>>>
>>>>>>> Sort of.
>>>>>>>
>>>>>>> AFAICT the *only* value that can distinguish multiple postings (of
>>>>>>> which there seem to be quite a lot) is the mailing list sequence
>>>>>>> number, which in ezmlm is only available in the Return-Path.
>>>>>>>
>>>>>>> However the sequence could perhaps be reset - so it seems risky to
>>>>>>> rely on that plus the list id alone.
>>>>>>> Also the earliest messages did not have sequence numbers.
>>>>>>>
>>>>>>> So there needs to be another way to identify distinct messages.
>>>>>>> In theory, that is the Message-Id.
>>>>>>> Even if it is not completely unique, it should be OK in combination
>>>>>>> with the sequence number.
>>>>>>>
>>>>>>> If the Message-Id does not exist or is very poor, then I think the
>>>>>>> only solution is to look at most - if not all - the parts of a message
>>>>>>> that a user can vary.
>>>>>>> This is what the full id generation does, except that it also includes
>>>>>>> the MTA-specific headers, causing problems with repeatability.
>>>>>>>
>>>>>>> There is another aspect to this. At present the generation is done
>>>>>>> using the parsed mail, rather than the original.
>>>>>>> If there is any chance that the format can change between releases of
>>>>>>> the library, then this may destroy repeatability.
>>>>>>> Likewise if a different library is ever used.
>>>>>>> We may need to additionally ensure that all headers are in a canonical
>>>>>>> form before use, in case MTAs decide to vary the layout.
>>>>>>>
>>>>>>>> 2) Store some more headers in the doc :)
>>>>>>>
>>>>>>> Those might be useful for searching, but each posting needs its own
>>>>>>> unique id otherwise it cannot be stored in the first place.
>>>>>>>
>>>>>>>>>
>>>>>>>>> The format will presumably depend on the mailing list software that 
>>>>>>>>> is used.
>>>>>>>>>
>>>>>>>>>>> List-Id
>>>>>>>>>>
>>>>>>>>>> I think this must be the original List-Id, not any override.
>>>>>>>>>> Otherwise there may be problems with permalinks if a list name is
>>>>>>>>>> updated - the old permalink will no longer work.
>>>>>>>>>>
>>>>>>>>>>> Whatever new algorithm is chosen, I think it's important that the
>>>>>>>>>>> format looks different from the existing ones. e.g. one could drop 
>>>>>>>>>>> the
>>>>>>>>>>> <> around the list id.
>>>>>>>>
>>>>>

Re: Better Id generator

Reply via email to