Another issue I have noticed - From_ mangling. Exported mbox files must use From_ mangling to be usable as input. However the commonest method (mboxo) only mangles a plain From_, so one cannot then distinguish From_ from >From_. [It looks like the ASF mod_mbox archive use mboxo]
This means that the live message as received by the archiver may not be the same as the unmangled mbox entry. The current importer does not unmangle >From_ lines, which means that re-importation of a live message will change the MID (and the Permalink) Likewise once the importer is fixed to do the unmangling, re-importation of previously imported messages will change the MID (and the Permalink) There will still be the issue of >From_ lines in original messages, but that should hopefully only affect a few messages (quoted lines have a space after the >) On 8 December 2016 at 13:25, sebb <[email protected]> wrote: > I have just discovered that there are some duplicate ezmlm message > index numbers in the [email protected] archives. > This means we cannot rely on the number being unique, though the From > line is almost certainly unique id the date is included. > But unless the same data is present in all copies of the message, > hashes derived from it won't be stable. > > 200103.mbox:From > [email protected] Fri > Mar 09 20:26:49 2001 > 200103.mbox:From > [email protected] Mon > Mar 12 12:19:04 2001 > 200103.mbox:From > [email protected] Mon > Mar 12 19:09:39 2001 > 200103.mbox:From > [email protected] Wed > Mar 28 17:46:40 2001 > 200104.mbox:From > [email protected] Mon > Apr 09 19:25:39 2001 > 200105.mbox:From > [email protected] Tue > May 01 21:14:21 2001 > 200105.mbox:From > [email protected] Tue > May 01 21:59:08 2001 > 200105.mbox:From > [email protected] Sat > May 12 22:11:46 2001 > 200105.mbox:From > [email protected] Mon > May 21 22:51:04 2001 > 200105.mbox:From > [email protected] Tue > May 22 17:12:40 2001 > 200105.mbox:From > [email protected] Thu > May 31 06:14:01 2001 > 200106.mbox:From > [email protected] Fri > Jun 01 14:31:52 2001 > 200107.mbox:From > [email protected] Tue > Jul 17 18:56:30 2001 > 200110.mbox:From > [email protected] Sun > Oct 14 03:23:08 2001 > 200111.mbox:From > [email protected] Fri > Nov 16 00:13:49 2001 > > 200210.mbox:From > [email protected] Thu > Oct 03 19:20:20 2002 > 200210.mbox:From > [email protected] Fri > Oct 04 19:01:01 2002 > 200211.mbox:From > [email protected] Tue > Nov 05 20:29:51 2002 > 200211.mbox:From > [email protected] Wed > Nov 27 19:39:07 2002 > 200301.mbox:From > [email protected] Mon > Jan 20 23:34:23 2003 > 200301.mbox:From > [email protected] Sun > Jan 26 08:21:21 2003 > 200302.mbox:From > [email protected] Mon > Feb 24 16:35:58 2003 > 200303.mbox:From > [email protected] Mon > Mar 03 15:44:33 2003 > 200303.mbox:From > [email protected] Mon > Mar 17 15:55:44 2003 > 200305.mbox:From > [email protected] Wed > May 28 16:40:34 2003 > 200307.mbox:From > [email protected] Wed > Jul 09 12:06:06 2003 > 200307.mbox:From > [email protected] Fri > Jul 18 13:48:09 2003 > 200308.mbox:From > [email protected] Tue > Aug 05 16:10:28 2003 > 200308.mbox:From > [email protected] Wed > Aug 13 10:26:40 2003 > 200308.mbox:From > [email protected] Fri > Aug 15 10:21:03 2003 > > I don't know what happened to create duplicates partway through the sequence. > The sequence proper started here: > > 200201.mbox:From > [email protected] Fri > Jan 11 20:30:21 2002 > > Prior to Jan 2002 AFAICT there are only numbers 32-46 as shown above > in the 2001 archives. > > > On 16 November 2016 at 21:50, sebb <[email protected]> wrote: >> Just discovered an issue which affects the Permalinks. >> >> If the archiver msgbody() function fails to detect a text message >> body, it will return null. >> If the message has an attachment, then the archiver will generate an >> entry for it. >> However the short and medium id generators will fail when accessing the body. >> >> The mid will then revert to the previously calculated value. >> This is derived from the list id and the 'archived-at' header. >> However the archived-at header is added by the archiver itself if necessary. >> So if the message ever needs to be reloaded from elsewhere, a new >> mid/Permalink value will likely be generated. >> The Permalink will stop working unless the original entry is kept. >> >> Unfortunately the fallback mids have exactly the same format as the >> medium generator (sha224@lid). >> This makes it impossible to determine which Permalinks are affected >> from the link alone. >> However any existing mbox entries which have body: null will be using >> the fallback mid. >> >> >> On 25 October 2016 at 19:50, sebb <[email protected]> wrote: >>> Unfortunately it appears that Message-Ids are not always generated >>> (*), so there needs to be an equivalent that can be used. >>> Furthermore, some emails may have more than one Message-Id [+] >>> >>> The original message fields (including full body) are not enough to >>> uniquely id a message in the mailing list stream. >>> For that one needs something like a sequence number. >>> However some early mailboxes don't include the ezmlm sequence number. >>> >>> However AFAIK every message that reaches a mailing list must have been >>> delivered to the mailserver mailbox. >>> So the mailing list message *instance* should be identifiable using >>> either the sequence number, or failing that, some or all of the >>> routing information by which it arrived at the mailbox. >>> >>> And one can identify the mail *content* using the Message-Id if >>> present (failing that, a hash of the full e-mail). >>> >>> This needs some more work and some worked example messages. >>> >>> Sorry to keep going on about this, but AFAICT currently Pony Mail does >>> not have an algorithm that satisfies all the requirements for use as >>> an ES id. >>> >>> (*) Automated services and older e-mails don't always have them. >>> Such mails don't appear in mod_mbox listings though they are present >>> in the raw mbox files. >>> [+] mod_mbox appears to use the first Message-Id it finds. >>> >>> On 22 October 2016 at 23:02, sebb <[email protected]> wrote: >>>> On 18 October 2016 at 10:49, Daniel Gruno <[email protected]> wrote: >>>>> On 10/18/2016 11:43 AM, sebb wrote: >>>>>> It just occurs to me that there is another aspect to be considered: >>>>>> list renaming. >>>>>> This might affect both the unique id and Permalinks. >>>>>> >>>>>> If the list id is embedded in the Permalink hash, I think it would be >>>>>> very difficult to honour existing links. >>>>>> >>>>>> This is in contrast with the link format as used by mod_mbox, where >>>>>> the list id is exposed in the URL, for example: >>>>>> >>>>>> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201209.mbox/<message-id> >>>>>> http://mail-archives.apache.org/mod_mbox/any23-dev/201209.mbox/<message-id> >>>>>> >>>>>> It would be easy enough to redirect one URL to the other - if necessary. >>>>>> (In this case, the original lists have been retained) >>>>> >>>>> Generally not an issue. >>>>> when you rename a list in Pony Mail, you keep the original ID and >>>>> source, you only touch the list value in elasticsearch. permalinks will >>>>> remain the same, but point to the new list name (and keep the original >>>>> in the source). >>>> >>>> I see, that's probably OK. >>>> >>>>> There are of course edge cases where you have to reimport and you force >>>>> a list ID on the command line, which may change the generated ID. >>>> >>>> That could be a problem both for the ES id and the Permalink. >>>> If the id uses the new LID, it will result in a new MID. >>>> If the old message is still present, there will be two copies. >>>> If the old message is deleted, the Permalink will break. >>>> >>>> I think we do need to allow for re-imports. >>>> This makes generation of the MID tricky. >>>> >>>>> If we scrap the list id for a moment, we are left with, at least, the >>>>> following constants: >>>>> >>>>> Sender >>>>> Date >>>>> Subject >>>>> Body >>>> >>>> (I assume this includes the attachments) >>>> >>>>> Message ID >>>>> >>>>> The problem with the above is, that an email with the same values in all >>>>> fields may have been sent to multiple lists, so ideally we'd keep >>>>> separate copies for each list....which gets tricky without adding the >>>>> list ID to this. A workaround would be to force the ID generator to >>>>> always use the original list ID in the source for generating the ID, and >>>>> only use the forced list ID if no list was found. >>>> >>>> The identical message may be sent to the same list twice. >>>> I've seen this already on some of the lists (I think I mentioned this >>>> already). >>>> It looks like a mail client sometimes duplicates the message. >>>> Maybe it can also occur if the destination list has an alias and the >>>> message is sent to the alias as well. >>>> >>>> So the list id is not sufficient to identify the specific message. >>>> >>>> However mailing list software has to identify bounces. >>>> This is usually done by adding a unique id to the Return-Path, >>>> In the case of ezmlm this is a sequence number. >>>> >>>> That will be present in mails sent to subscribers, and is present in >>>> the mbox files (apart from some very early ones). >>>> >>>> If present, it should be a constant for a particular message on a >>>> specific mailing list. >>>> >>>>> Thoughts? >>>> >>>> The MID needs to be unique to ensure that all messages can be stored OK in >>>> ES. >>>> Unwanted duplicates cause message loss. >>>> >>>> The Permalink needs to be persistent. >>>> If there are multiple Permalinks for the same message that is not a >>>> problem. >>>> If a single Permalink applies to multiple messages, that is not ideal, >>>> but at least there is a chance of recovering the message. >>>> But if a Permalink disappears, then it is a big problem. >>>> >>>> Although the mod_mbox solution is not perfect, its Permalinks have the >>>> advantage that the Message-ID and an idea of the List-name can be got >>>> from the link. >>>> This means that there is good chance of being able to find the >>>> matching message(s) in almost any mail archive, should the originals >>>> be unavailable. >>>> >>>> With the current Pony Mail links, if a message is lost from the >>>> database, AFAICT the only way to recover it is to re-import all the >>>> messages and hope the same ids are re-generated. >>>> Since there have been at least two different generator algorithms >>>> used, and the generator was/is sensitive to the host time zone, this >>>> will likely be a lengthy process. >>>> The long links do at least include the list-id, which would reduce the >>>> work somewhat. >>>> >>>> So I think the Permalink needs to be similar to the current mod_mbox >>>> solution. >>>> The MID does not need to be directly related, so long as it is unique >>>> and stable. >>>> >>>>>> >>>>>> How can list renaming be managed in Pony Mail? >>>>>> Or is it not allowed? >>>>>> >>>>>> >>>>>> On 14 October 2016 at 00:21, sebb <[email protected]> wrote: >>>>>>> On 13 October 2016 at 23:14, Daniel Gruno <[email protected]> wrote: >>>>>>>> On 10/14/2016 12:12 AM, sebb wrote: >>>>>>>>> On 13 October 2016 at 21:28, sebb <[email protected]> wrote: >>>>>>>>>> On 7 October 2016 at 00:44, sebb <[email protected]> wrote: >>>>>>>>>>> The id generator is used to create a key for the message database, >>>>>>>>>>> and >>>>>>>>>>> also to create a Permalink. >>>>>>>>>>> >>>>>>>>>>> Therefore, an id generator needs to fulfil the following design >>>>>>>>>>> goals >>>>>>>>>>> as a minimum: >>>>>>>>>>> A) different messages have different IDs >>>>>>>>>>> B) the same id is generated if the same message is re-processed >>>>>>>>>>> C) equivalent messages have the same ID >>>>>>>>>>> >>>>>>>>>>> Goal A is needed to ensure that the database can contain every >>>>>>>>>>> different message >>>>>>>>>>> Goal B is needed to ensure that the database can be reloaded from >>>>>>>>>>> the >>>>>>>>>>> original source if necessary >>>>>>>>>>> Goal C is needed to ensure that the database can be reloaded from an >>>>>>>>>>> equivalent source, and to ensure that Permalinks are stable. >>>>>>>>>>> >>>>>>>>>>> None of the current id generator algorithms meet all of the above >>>>>>>>>>> goals. >>>>>>>>>>> >>>>>>>>>>> The original and medium generators fail to meet goal A. >>>>>>>>>>> The full generator fails to meet goal B (and therefore C). >>>>>>>>>>> >>>>>>>>>>> A sender can easily generate two messages with identical content; it >>>>>>>>>>> is important to distinguish these. >>>>>>>>>>> >>>>>>>>>>> The Message-Id should help here. >>>>>>>>>>> >>>>>>>>>>> Message-ID is supposed to be unique, in practice it may not be, so >>>>>>>>>>> some additional fields need to be used to create the database id. >>>>>>>>>>> >>>>>>>>>>> For mailing lists the Return-Path will normally contain a unique id >>>>>>>>>>> which is used to identify bounces. >>>>>>>>>>> In theory this might be sufficient on its own. Indeed the path might >>>>>>>>>>> be usable without hashing. However the early ASF mailing list >>>>>>>>>>> software >>>>>>>>>>> did not use unique Return-Paths. >>>>>>>>>>> >>>>>>>>>>> The existing mod_mbox solution uses a combination of Message-Id plus >>>>>>>>>>> YYYYMM plus a list identifier. Have there ever been any collisions? >>>>>>>>>> >>>>>>>>>> It's certainly possible for mod_mbox to contain multiple messages >>>>>>>>>> with >>>>>>>>>> the same id. >>>>>>>>>> >>>>>>>>>> For example: >>>>>>>>>> >>>>>>>>>> From >>>>>>>>>> dev-return-15575-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org >>>>>>>>>> Tue Jun 7 20:08:04 2016 >>>>>>>>>> and >>>>>>>>>> From >>>>>>>>>> dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org >>>>>>>>>> Tue Jun 7 20:24:24 2016 >>>>>>>>>> >>>>>>>>>> both have the same id: >>>>>>>>>> >>>>>>>>>> Message-Id: <[email protected]> >>>>>>>>>> >>>>>>>>>> It's exactly the same message which arrived twice in the same mailing >>>>>>>>>> list a few seconds apart. >>>>>>>>>> Maybe it was sent using Bcc as well as To: ? >>>>>>>>>> >>>>>>>>>> It's not exactly a collision, but at present mod_mbox is able to >>>>>>>>>> store >>>>>>>>>> both whereas Pony Mail cannot (except if using the full generator, >>>>>>>>>> which has other problems) >>>>>>>>>> >>>>>>>>>> I think it's important to store the full message history in the >>>>>>>>>> database. >>>>>>>>>> For example, if one of the messages bounces, it would be odd if the >>>>>>>>>> source of the bounce were not in the database. >>>>>>>>>> Also the message sequences will be incomplete. >>>>>>>>>> This is the case for lists.a.o, the mbox >>>>>>>>>> >>>>>>>>>> https://lists.apache.org/api/[email protected]&date=2016-6 >>>>>>>>>> >>>>>>>>>> does not have the message sequence number 15575 >>>>>>>>>> >>>>>>>>>>> How about using the following: >>>>>>>>>>> >>>>>>>>>>> Message-Id >>>>>>>>>>> Date >>>>>>>>>>> Return-Path >>>>>>>>>> >>>>>>>>>> In this case the return-path can be used to distinguish the messages. >>>>>>>>> >>>>>>>>> Note that the path depends on where the message is stored: >>>>>>>>> >>>>>>>>> Return-Path: >>>>>>>>> <dev-return-15577-archive-asf-public=cust-asf.ponee...@airavata.apache.org> >>>>>>>>> as against >>>>>>>>> Return-Path: >>>>>>>>> <dev-return-15577-apmail-airavata-dev-archive=airavata.apache....@airavata.apache.org> >>>>>>>>> >>>>>>>>> So it will need adjusting to extract the parts that are the same for >>>>>>>>> all message sources, whether that is mod_mbox or Pony Mail or sent to >>>>>>>>> another subscriber. >>>>>>>> >>>>>>>> >>>>>>>> How about we: >>>>>>>> 1) select a larger list of headers that we know to be the same across >>>>>>>> the same email sent to different MTAs, and combine them for the ID >>>>>>>> generation. >>>>>>> >>>>>>> Sort of. >>>>>>> >>>>>>> AFAICT the *only* value that can distinguish multiple postings (of >>>>>>> which there seem to be quite a lot) is the mailing list sequence >>>>>>> number, which in ezmlm is only available in the Return-Path. >>>>>>> >>>>>>> However the sequence could perhaps be reset - so it seems risky to >>>>>>> rely on that plus the list id alone. >>>>>>> Also the earliest messages did not have sequence numbers. >>>>>>> >>>>>>> So there needs to be another way to identify distinct messages. >>>>>>> In theory, that is the Message-Id. >>>>>>> Even if it is not completely unique, it should be OK in combination >>>>>>> with the sequence number. >>>>>>> >>>>>>> If the Message-Id does not exist or is very poor, then I think the >>>>>>> only solution is to look at most - if not all - the parts of a message >>>>>>> that a user can vary. >>>>>>> This is what the full id generation does, except that it also includes >>>>>>> the MTA-specific headers, causing problems with repeatability. >>>>>>> >>>>>>> There is another aspect to this. At present the generation is done >>>>>>> using the parsed mail, rather than the original. >>>>>>> If there is any chance that the format can change between releases of >>>>>>> the library, then this may destroy repeatability. >>>>>>> Likewise if a different library is ever used. >>>>>>> We may need to additionally ensure that all headers are in a canonical >>>>>>> form before use, in case MTAs decide to vary the layout. >>>>>>> >>>>>>>> 2) Store some more headers in the doc :) >>>>>>> >>>>>>> Those might be useful for searching, but each posting needs its own >>>>>>> unique id otherwise it cannot be stored in the first place. >>>>>>> >>>>>>>>> >>>>>>>>> The format will presumably depend on the mailing list software that >>>>>>>>> is used. >>>>>>>>> >>>>>>>>>>> List-Id >>>>>>>>>> >>>>>>>>>> I think this must be the original List-Id, not any override. >>>>>>>>>> Otherwise there may be problems with permalinks if a list name is >>>>>>>>>> updated - the old permalink will no longer work. >>>>>>>>>> >>>>>>>>>>> Whatever new algorithm is chosen, I think it's important that the >>>>>>>>>>> format looks different from the existing ones. e.g. one could drop >>>>>>>>>>> the >>>>>>>>>>> <> around the list id. >>>>>>>> >>>>>
