Re: [Mailman-Developers] Improving the archives
Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%. Message ID's are supposed to be unique. This is discussed in in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places. If that's not the case, the mail transfer agent is broken. I think it's better to go ahead and use the mesage-id, rather than concoct yet another this time we mean it! unique identifier. This is a cost/benefit thing; the cost is some real world collisions, the benefit is a conceptually simpler system. Conceptually simpler things are good especially when implemented all over the place. Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages. This sounds like an internet wide interoperability issue as much as something mailman specific. Why not come up with a scheme usable by all list servers? And also describe a specification third party archival services can comply to. Besides, I've always wanted to help write an RFC. If we go that route, it would be good to get input from a range of people - one person I'd suggest is Earl Hood, author of mhonarc. Thoughts? Jeff While I'm almost tempted to ignore a hit rate that low, if you think of an archive holding 1B messages, you still get a lot of duplicates. OTOH, the rate goes down even lower if you consider the message-id and date headers. (Note, I did not consider messages missing a date header). How likely is it that two messages with the same message-id and date are /not/ duplicates? Heck, at that point, I'd feel justified in simply automatically rejecting the duplicate and chucking it from the archive. I spent a /little/ time looking at the physical messages that ended up as true collisions. Though by no means did I look at them all, they all looked related. For example, with strategy 2 some messages look like they'd been inadvertently sent before they were completed. I need to see if there's any similarities in MUA behind these, but again, I think we might be able to safely assume that collisions on message-id+date can be ignored. That leads me to the following proposal, which is just an elaboration on Stephen's. First, all messages live in the same namespace; they are not divided by target mailing list. Each message has two addresses, one is the Message-ID and one is the base32 of the sha1 hash of the Message-ID + Date. As Stephen proposes, Mailman would add these headers if an incoming message is missing them, and tough luck for the non-list copy. The nice thing is that RFC 2822 requires the Date header and states that Message-ID SHOULD be present. Why the second address? First, it provides as close to a guaranteed unique identifier as we can expect, and second because it produces a nearly human readable format. For example, Stephen's OP would have a second address of mid '[EMAIL PROTECTED]' date 'Wed, 04 Jul 2007 16:49:58 +0900' # XXX perhaps strip off angle brackets h = hashlib.sha1(mid) h.update(date) base64.b32encode(h.digest()) 'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI' I like base32 instead of base64 because the more limited alphabet should produce less ambiguous strings in certain fonts and I don't think the short b64 strings are short enough to justify the punctuation characters that would result. While RFC 3548 specifies the b32 alphabet as using uppercase characters, I think any service that accepts b32 ids should be case insensitive. A really Postel-y service could even accept '1' for 'I' and '0' for 'O' just to make it more resilient to human communication errors. I'd like to come up with a good name for this second address, which would suggest the name of the X- header we stash this value in. X- B32-Message-ID isn't very sexy. Maybe X-Message-Global-ID, since I think there's a reasonable argument to make that for well-behaved messages, that's exactly what this is. So now, think of the interface to a message store that supports this addressing scheme. Well it's something like: class MessageStore(Interface): def store_message(message): Store the message. :raises ValueError: when the message is missing either the Message-ID header or a Date header. :raises DuplicateMessageError: when a message in the store already has a matching Message-ID and Date. An archive is free to raise this exception for duplicate Message-IDs alone. def get_message_by_global_id(key): Locate and return the message from the store that matches `key`. :param key: The Global ID of the message to locate. This is the base32 encoded SHA1 hash of the message's Message-ID and Date headers. :returns: The message object matching the Global ID, or None if there is no such match.
Re: [Mailman-Developers] Improving the archives
Jeff Breidenbach writes: Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%. Message ID's are supposed to be unique. Fortunately, a rule more honored in the observance than the breach. Nonetheless, it *is* breached. The Postel Principle applies here, IMO. better to go ahead and use the mesage-id, rather than concoct yet another this time we mean it! unique identifier. That's not the point. We're not going to impose this on senders; that's what Message-ID is for, as you say. If a sender won't provide a proper Message-ID, third parties who get a CC are just out of luck. I simply think we should be prepared for applications where relying on the sender to supply a UUID is not acceptable; we need to be able to provide one ourselves. Creating UUIDs is a solved problem, after all. So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. Then we say that an archive SHOULD provide access to the resource via Message-ID if available, and define how to construct that URL from the List-Archive and Message-ID headers. Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages. I think Barry already suggested that? Anyway, +1. But remember, a standards-track RFC should have a working implementation to point to. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
st == Stephen J Turnbull Re: [Mailman-Developers] Improving the archives Tue, 24 Jul 2007 15:56:35 +0900 st Jeff Breidenbach writes: Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%. Message ID's are supposed to be unique. st Fortunately, a rule more honored in the observance than the st breach. Nonetheless, it *is* breached. The Postel Principle st applies here, IMO. Taking be conservative in what you do as being at least as important as be liberal in what you accept from others, the devil can quote this scripture to support simplicity in this instance, IMHO. better to go ahead and use the mesage-id, rather than concoct yet another this time we mean it! unique identifier. st That's not the point. We're not going to impose this on st senders; I read the quote as meaning this time we mean it really is unique, imposing nothing on senders. st that's what Message-ID is for, as you say. If a sender won't st provide a proper Message-ID, third parties who get a CC are st just out of luck. Right. Maybe that will encourage compliance. The complexity of catering to brokenness in this instance may be too high a price to impose on the all. jam pgpVlVlfc9EJj.pgp Description: PGP signature ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
John A. Martin writes: better to go ahead and use the mesage-id, rather than concoct yet another this time we mean it! unique identifier. st That's not the point. We're not going to impose this on st senders; I read the quote as meaning this time we mean it really is unique, imposing nothing on senders. Ah. If so, my reply is if you want something done right, do it yourself. *All robust databases assign a unique ID to each record.* Why shouldn't a mailing list archive do so? Right. Maybe that will encourage compliance. The complexity of catering to brokenness in this instance may be too high a price to impose on the all. What complexity? Mailman just does msg['X-List-Archive-Received-ID'] = Email.msgid() (or however the message ID generator is spelled). After that, it's up to the archiver whether to do anything with it or not. I proposed a way that it could be used; if that's considered too complex, fine. But simply assigning one is not complex or otherwise very costly. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
There are three different parties coming to the table. One is the mail transfer agent of the sender, another is the list server, and the third is the archive server. Ideally, all three will be happy campers. So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :) What complexity? Mailman just does msg['X-List-Archive-Received-ID'] = Email.msgid() Easy to introduce, harder to deal with. The archival server would now keep track of both the message-id and the x-list-archive-received-id. That's two namespaces that almost do the same thing. It's easier for the archive server to keep track of one name space than two, and - most importantly - conceptually simpler. From the perspective of the assorted list servers, it's easier to do nothing than to do something. So if they can get by with just message-id (which is already implemented) not have to add x-list-archive-received-id, that's a smoother implementation path. If we base on message-id, archival servers will be able to retroactively add support for all their stored messages, even those that are ten years old. And users holding an old message will be able to figure out that URL without doing any computational gymnastics. Put another way, there's the possibility to reduce the archive servers' implementation to search for this mesage-id which is something really useful to have anyway, and therefore likely to get wider support. In addition, Barry was talking about concocting a unique identifier from the Date field and Message-ID. I'm not a big fan of this idea, because the date field comes from the mail user agent and is often wildly corrupt; e;g; coming from 100 years in the future. Very painful if the archive is showing most recent message first. Therefore an archival server is very likely to determine message date from the most recent received header (generally from a trusted mail transfer agent) rather than the date field. From the archive server's perspective, the best thing to do with the date field is throw it away. So for these reasons, I'd rather stick with message-id and risk some real world collisions, instead of introduce another identifier. If the list server receives a message with no message-id, by all means create one on the spot. To me, this feels like the sweet spot in terms of cost benefit. The main thing that bugs me is message-ids are long, which makes them awkward to embed in a URL in the footer of a message. Jeff ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Jeff Breidenbach wrote: In addition, Barry was talking about concocting a unique identifier from the Date field and Message-ID. I'm not a big fan of this idea, because the date field comes from the mail user agent and is often wildly corrupt; e;g; coming from 100 years in the future. Oh--I was assuming the Date to which he was referring was the current timestamp at which mailman was processing the message. I was going to say that this guarantees uniqueness, but I guess there are parallel mailman implementations where more than one machine/processor are all serving the same list, and then two different machines/processors might wind up with identical timestamps while processing two different messages. -Dale ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote: So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :) Someone already pointed out that the message ID is a bit long for a URL, so I'm guessing we're going to want some sort of shorter sequence number for messages for linking purposes. Regardless of whether we *need* to generate our own unique ID, I'm leaning towards the thought that we're going to *want* to generate our own for usability reasons. In a perfect world, i think we'd have a sequence number so I could visit http://example.com/mailman/ archives/listname/204.html and know that 205.html would be the next message to that list, but any short unique id would do if sequence numbers are too much of a pain. It seems silly to generate nice short links but then use message-id. If we can generate nice short links, we might as well use 'em throughout, unless you really think the default use of the archive will be to search it by messageid (which I sincerely doubt, from my user experiences). Terri ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Regardless of whether we *need* to generate our own unique ID, I'm leaning towards the thought that we're going to *want* to generate our own for usability reasons. In a perfect world, i think we'd have a sequence number so I could visit http://example.com/mailman/ archives/listname/204.html and know that 205.html would be the next message to that list, but any short unique id would do if sequence numbers are too much of a pain. I agree there's a lot of usability benefits from short URLs, but perhaps this is the job of the archive server, and not the list server. Mharc (an archive server) is a great example here. Mharc's canonical message format is pretty human friendly. http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg0.html Unfortunately, there's no trivial way for the list server to know that human friendly URL when the message is sent out. Fortunately, Mharc is also happy handles messages by message-id, which the list server does know about. http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc-users[EMAIL PROTECTED] Had I been the implementer, I'd probably have made mharc do an HTTP 302 redirect from the longer URL to the shorter URL. But that's besides the point. The point is we have an existing, working, happy archival server, and it would be really nice if list servers (such as mailman) were compatible. And by compatible, I mean offering the capability of embedding an archival URL in the footers of messages. -Jeff ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 22, 2007, at 12:33 PM, Terri Oda wrote: On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote: I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. I've been doing a lot of thinking about interface, and I'm coming to the conclusion that something more like a web bulletin board is probably the way to go, given that people use them all the time without much trouble and with a fairly minimal amount of whining. ;) I like this for several reasons. I've long wanted a bridge between the traditional mailing list and a forum because to me they're related along a spectrum of emotional investment. What I mean is this. For the subjects and projects I care deeply about, I join the mailing list. I want to be intimately involved in the day-to-day collaboration that being subscribed gives me. I care enough about that that I'm willing to put up with the pain that comes along with mailing lists, such as the overhead for subscribing, deleting topics I don't care about, the occasional spam, the overhead of going on vacation or leaving the list, etc. But there are even more topics or projects that I have only a fleeting interest in. Say I find a bug in some X program, or wake up and decide to learn how to use setuptools, or find that some recent update broke my Linux server. In all those cases, I might want to start a thread of discussion or ask a question, and be very involved in that thread for a week or two. Then, my interest wanes, or I get my question answered, or other projects pique my interest. Mailing lists are pretty bad at managing those kinds of fleeting involvement, but forums are quite nice. There's usually fairly low overhead (and probably even less if OpenID and such were in widespread adoption) for joining, and when I lose interest the forum doesn't fill up my inbox. OTOH, forums seem good for short 'instant' messages, but not so good (IMO) for free ranging, detailed discussions. So there's a spectrum. I'm trying to use interfaces to things like comment systems (which are often threaded -- picture the slashdot stuff, maybe?) and popular boards like phpbb (which isn't threaded beyond separate topics) as guides to how people usually deal with conversations on the web. It'd actually be fairly easy, at that point, to just put a posting interface into the archives (yes, you'd have to be logged in, and yes, this means your password becomes that bit more valuable because someone having it can pose as you to the list... but they could do that by spoofing your email address so I'm not too concerned). But then people who don't like email or just want to pop by and check the list quickly could actually use mailman like a web board, which is something I'm pretty sure would get used (I know my users have asked for it in the past). Heck, /I'd/ use it, so what more justification do we need? :) I've been drafting simple prototype interfaces in my head, trying to keep potential architectures in mind. I'm hoping I'll have time this week to code some up HTML and see how well they actually work when they're not just inside my head. :) I'd love to see the prototypes once you've committed them to HTML. The one important thing is that the individual postings will need the equivalent of a stable archive URL (i.e. permlink) that could be passed around, added to web pages, etc. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZH43EjvBPtnXfVAQLrzQP8CG5ALhX+Wk91I+jri20R60C7cqtCzQby V9MD8FlhC/7LbRW3QXwJnwWSpXCnBYhShxmRMn2maEeIXqPUEBl3QOcUYkHxeRZG zV6sKE1J1EZfbUTY7CM3lcnOZKHB1n07PGslcxQsJHEmnbuHbR7bm+2AV2CknzZj 8Y/9XxPjX5Q= =IRq2 -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 2:02 AM, Jeff Breidenbach wrote: Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages. This sounds like an internet wide interoperability issue as much as something mailman specific. Why not come up with a scheme usable by all list servers? And also describe a specification third party archival services can comply to. Besides, I've always wanted to help write an RFC. If we go that route, it would be good to get input from a range of people - one person I'd suggest is Earl Hood, author of mhonarc. I've always thought that an RFC-like spec that describes how a generic mailing list manager would interoperate with a generic archiving service is the way to go. I've written up a somewhat more formal spec of what I've implemented MM3 currently here: http://wiki.list.org/display/DEV/Stable+URLs If this looks good, I'd be happy to approach some of the related communities to try to get buy-in. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZIjHEjvBPtnXfVAQLK9AP/VQveYtFuZhJam9TITYBuMyc8pig7nqDt efn4DIXhZhgtqBQ58/TgEFZnTkKfiZ1HLdoovrQye8HdKZmuAd+SJrOkq/aO9fIC ZgaV5HYBD7TcnQuO2z5eRuK3IY7FpWoeZrn/a6sxBObsaSOrOTjhqs1gv5go24d3 8CmG/bB9LTo= =EyoU -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 2:56 AM, Stephen J. Turnbull wrote: I simply think we should be prepared for applications where relying on the sender to supply a UUID is not acceptable; we need to be able to provide one ourselves. Creating UUIDs is a solved problem, after all. So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. Then we say that an archive SHOULD provide access to the resource via Message-ID if available, and define how to construct that URL from the List-Archive and Message-ID headers. I think there's two approaches we could argue for. One is for the mailing list manager to craft a UUID out of whole cloth and stick that in a header. Then any downstream archiver would be obliged to use that header value as the canonical address of the message, with an alternative path to the message via the Message-ID (possibly returning a list of matching messages when there are collisions). The second approach, and the one that I favor, is to use the Message- ID (and the Date) header on the original message as the UUID, properly handling corner cases like duplicate headers or missing header. This UUID servers as the basis for the address to the message resource just like above. I like the second approach better because in the case where you start with an off-list copy of the message, you have a decent enough chance of getting to the archived message, or at least to a resource containing a link to the message. The first alternative would require access to the list copy. Imagine if every archiver supported my proposal, knowing just the Message-ID and Date header, you could get to that message from almost anywhere, just by using the UUID as a relative URL rooted at say http://www.mail-archive.com, http://groups.google.com, http:// mail.python.org/pipermail, or whatever. That would be pretty neat. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZKpnEjvBPtnXfVAQJWcwP6A6SqHTeft+c/5IeSpRsI+gvtPJW94fcG pjB66oYiKco7U+rZtxll3TPD9Ta7gccohq72sh8hV7CHRW7Cd531Hq91z7QktHUW zqzxkMimoca7WlUxr0/ElyPNhRkjMlR8LvhNCjs4a9O6/PpzBTNjsXwaTKfLrqO3 N5iq3BWoMK8= =fSNC -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote: What complexity? Mailman just does msg['X-List-Archive-Received-ID'] = Email.msgid() Easy to introduce, harder to deal with. The archival server would now keep track of both the message-id and the x-list-archive-received-id. That's two namespaces that almost do the same thing. It's easier for the archive server to keep track of one name space than two, and - most importantly - conceptually simpler. True, but an archiver already has to handle collisions on the Message- ID so in a sense, you have to maintain multiple paths to the same message, don't you? So I like my proposal because it imposing nothing additional on the MUA or MTA, a tiny bit more on the MLM, and some extra work (though I think not much) on the archiving agent. What you gain from my proposal over a pure Message-ID approach is guaranteed uniqueness given the list copy, and human friendlier urls. From the perspective of the assorted list servers, it's easier to do nothing than to do something. So if they can get by with just message-id (which is already implemented) not have to add x-list-archive-received-id, that's a smoother implementation path. If we base on message-id, archival servers will be able to retroactively add support for all their stored messages, even those that are ten years old. And users holding an old message will be able to figure out that URL without doing any computational gymnastics. All these are still true with my proposal, except with the observation as Stephen points out that given a URL based on sender- provided headers, you must be prepared to deal with collisions, so sometimes your resources will return lists. The advantage of adding a bit of MLM-provided information is that given the list copy you can guarantee uniqueness, and given the off-list copy you can get to a resource that contains a link to the message you want. Put another way, there's the possibility to reduce the archive servers' implementation to search for this mesage-id which is something really useful to have anyway, and therefore likely to get wider support. In addition, Barry was talking about concocting a unique identifier from the Date field and Message-ID. I'm not a big fan of this idea, because the date field comes from the mail user agent and is often wildly corrupt; e;g; coming from 100 years in the future. Very painful if the archive is showing most recent message first. Therefore an archival server is very likely to determine message date from the most recent received header (generally from a trusted mail transfer agent) rather than the date field. From the archive server's perspective, the best thing to do with the date field is throw it away. Throw it away or hide it? The former would be a problem, but not the latter. Does your archiver keep a canonical copy of the message as you received it? If so, then you preserve the original Date header enough for the calculation to occur, even if you hide the Date header, or display a Received header date when you render it to HTML. That doesn't matter of course. But I should point out that I'm not married to including the Date header in the hash. I like it because it appears to reduce collisions which I care about. But I still like using the base32 sha1 hash instead of the raw Message-ID because I think it's easier for humans to use, read, speak, and copy. Of course this doesn't mean that you need to disable your search-by-Message-ID feature! So for these reasons, I'd rather stick with message-id and risk some real world collisions, instead of introduce another identifier. If the list server receives a message with no message-id, by all means create one on the spot. To me, this feels like the sweet spot in terms of cost benefit. The main thing that bugs me is message-ids are long, which makes them awkward to embed in a URL in the footer of a message. Another advantage for the URL scheme I propose. You know you're going to end up with URLs of len(host-prefix) + 32 + 1 + #digits-in- seqno (32 == base32(sha1digest(data)) (1 == / divider) (#digits-in-seqno == e.g. len(str(seqno)) You should be able to keep things in the 60-70 character range, including the host name. That doesn't seem too bad. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZO4HEjvBPtnXfVAQIYGwP/VZPCiQrg9CTeMThApNTh7xUismbW0AiT 1N6a8DusXDBrqiLDQd+v2/R5KOV+TnwDNlIcl5FfFatHxWJ0bGy850kT/nhrHdKU UrW0hR8PWSMIRN5Bqx9bL9cvaMigAoyX+njAfiDgl0yy7arbAm66GH1HNH3c1XGT 1/qaGckINUg= =4uwH -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives:
Re: [Mailman-Developers] Improving the archives
Jeff Breidenbach writes: So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :) If it's not going to be canonical (I forget if there's a standard for that word :), what is the point in writing an RFC? What complexity? Mailman just does msg['X-List-Archive-Received-ID'] = Email.msgid() Easy to introduce, harder to deal with. The archival server would now keep track of both the message-id and the x-list-archive-received-id. That's two namespaces that almost do the same thing. The implementations are similar, and there is nearly a one-to-one correspondence. But the semantics are very different. Message-ID is untrustworthy, the internal ID is trustworthy. So for these reasons, I'd rather stick with message-id and risk some real world collisions, instead of introduce another identifier. Go ahead and stick with message-id if *you* like, but please don't tell *me* what risks I have to accept. There needs to be a way to *enforce* uniqueness, and it *must* be specified by the RFC in order for archive implementations to be interoperable. Note that word specify; I do not insist that this level of robustness be *required*. But if we don't specify it now, people who want such robustness will have to do all this work again, and possibly will end up with something that some servers conforming to your RFC will not conform to. It is possible that most archivers will simply use the message ID, and do something brutal in the rare case of a collision. That's fine. But an archiver that wants to provide a canonical URL which is guaranteed to uniquely and losslessly identify a post in its archive should have a standard way to do that. The main thing that bugs me is message-ids are long, which makes them awkward to embed in a URL in the footer of a message. The footer URL is of no concern in this discussion. There is not going to be a requirement that footer URLs be canonical, not if I have any say in the matter. The canonical URL will be in (or be constructed from) the message header. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
What you gain from my proposal over a pure Message-ID approach is guaranteed uniqueness given the list copy Guarantee is a pretty strong word. A malicious person could post two messages with the same message-id, same date, but different bodies. Sometimes the channel between the MLM and the archive server will be SMTP, and spurious messages can be injected. Finally, from the archive server's perspective, some of the MLMs might make mistakes - just like from the MLM's perspective, some of MTAs might make mistakes in setting message-id. So I don't think the proposed SHA1(date, message-id) scheme buys a hard guarantee of uniqueness. Every component has to protect themselves, but none can solve the world's problems. So that moves us to how many collisions are reduced in practice. I have a question about the numbers Barry mined from the python lists. Are the collisions really that high? One should not count messages without a message-id, because the MLM can and should create one in that case. One should also not count collisions of messages going to different lists. Here's why. Let's say message M is cross posted to lists L1 and L2. Even though it is the same message, there are now two different contexts. (For example, people visit M at archive L1 should get a completely different experience if they hit next message and people visiting M at archive L2.) So I'd be curious what the collision numbers come to with these two factors taken into account. The other takeaway is list name really should be part of the URL to get proper context. The earlier example from Mharc does this. and human friendlier urls. That's a very compelling point. SHA1 can't be computed inside someone's head or simple cut-n-pasted together for old messages, but I think the usability benefits of short URLs (short enough that they can comfortably fit inside message bodies) outweighs this drawback. By the way, is SHA-1 still in favor? My impression was it was fading away after the Shandong University team partially cracked it. Throw it away or hide [Date]? The former would be a problem, but not the latter. Thrown away. My favorite archival service is based on mhonarc, and raw mail goes into offline cold storage. Of course this can be changed for the future messages with some pain, but there's no reasonable way for myself (or any other mhonarc users in the same predicament) to retrofit against Date based URLs. For the record, here's what mhonarc embeds in each HTML page it produces because these were considered the important headers. In this message sent from Australia, the date shows a timezone of UTC -0700, because it was pulled from the received header. !-- MHonArc v2.6.15 -- !--X-Subject: [Gossip] Re: green#45;travel resources {webliographies} -- !--X-From-R13: [nephf Z. Saqvpbgg zraqvpbgNlnubb.pbz -- !--X-Date: Wed, 26 Apr 2006 00:27:27 #45;0700 -- !--X-Message-Id: [EMAIL PROTECTED] -- !--X-Content-Type: text/plain -- !--X-Reference: [EMAIL PROTECTED] -- !--X-Head-End-- So my main request is to double check the numbers, see if using Date really buys as much as one thinks. I'll keep digesting the other aspects of the wiki page. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp