Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
 Notice that of 325146 total messages, 624 of them had no message-id
 header.  Even if you aggregate dup+col, you're still looking at a
 total duplicate rate of 0.29%.

Message ID's are supposed to be unique. This is discussed in
in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places.
If that's not the case, the mail transfer agent is broken. I think it's
better to go ahead and use the mesage-id, rather than concoct
yet another this time we mean it! unique identifier. This is a
cost/benefit thing; the cost is some real world collisions, the benefit
is a conceptually simpler system. Conceptually simpler things are
good especially when implemented all over the place.

Which brings me to suggestion #2, which is go ahead and write
an RFC on how list servers should embed archival links in messages.
This sounds like an internet wide interoperability issue as much as
something mailman specific. Why not come up with a scheme usable
by all list servers? And also describe a specification third party archival
services can comply to. Besides, I've always wanted to help write
an RFC. If we go that route, it would be good to get input from a range
of people - one person I'd suggest is Earl Hood, author of mhonarc.

Thoughts?

Jeff





 While I'm almost tempted to ignore a
 hit rate that low, if you think of an archive holding 1B messages,
 you still get a lot of duplicates.

 OTOH, the rate goes down even lower if you consider the message-id
 and date headers.  (Note, I did not consider messages missing a date
 header).  How likely is it that two messages with the same message-id
 and date are /not/ duplicates?  Heck, at that point, I'd feel
 justified in simply automatically rejecting the duplicate and
 chucking it from the archive.

 I spent a /little/ time looking at the physical messages that ended
 up as true collisions.  Though by no means did I look at them all,
 they all looked related.  For example, with strategy 2 some messages
 look like they'd been inadvertently sent before they were completed.
 I need to see if there's any similarities in MUA behind these, but
 again, I think we might be able to safely assume that collisions on
 message-id+date can be ignored.

 That leads me to the following proposal, which is just an elaboration
 on Stephen's. First, all messages live in the same namespace; they
 are not divided by target mailing list.  Each message has two
 addresses, one is the Message-ID and one is the base32 of the sha1
 hash of the Message-ID + Date.  As Stephen proposes, Mailman would
 add these headers if an incoming message is missing them, and tough
 luck for the non-list copy.  The nice thing is that RFC 2822 requires
 the Date header and states that Message-ID SHOULD be present.

 Why the second address?  First, it provides as close to a guaranteed
 unique identifier as we can expect, and second because it produces a
 nearly human readable format.  For example, Stephen's OP would have a
 second address of

   mid
 '[EMAIL PROTECTED]'
   date
 'Wed, 04 Jul 2007 16:49:58 +0900'
   # XXX perhaps strip off angle brackets
   h = hashlib.sha1(mid)
   h.update(date)
   base64.b32encode(h.digest())
 'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI'

 I like base32 instead of base64 because the more limited alphabet
 should produce less ambiguous strings in certain fonts and I don't
 think the short b64 strings are short enough to justify the
 punctuation characters that would result.  While RFC 3548 specifies
 the b32 alphabet as using uppercase characters, I think any service
 that accepts b32 ids should be case insensitive.  A really Postel-y
 service could even accept '1' for 'I' and '0' for 'O' just to make it
 more resilient to human communication errors.

 I'd like to come up with a good name for this second address, which
 would suggest the name of the X- header we stash this value in.  X-
 B32-Message-ID isn't very sexy.  Maybe X-Message-Global-ID, since I
 think there's a reasonable argument to make that for well-behaved
 messages, that's exactly what this is.

 So now, think of the interface to a message store that supports this
 addressing scheme.  Well it's something like:

 class MessageStore(Interface):
  def store_message(message):
  Store the message.

  :raises ValueError: when the message is missing either the
 Message-ID
  header or a Date header.
  :raises DuplicateMessageError: when a message in the store
 already has
  a matching Message-ID and Date.  An archive is free to raise
 this exception
  for duplicate Message-IDs alone.
  

  def get_message_by_global_id(key):
  Locate and return the message from the store that matches
 `key`.

  :param key: The Global ID of the message to locate.  This is
 the
  base32 encoded SHA1 hash of the message's Message-ID and Date
  headers.
  :returns: The message object matching the Global ID, or None
 if there
  is no such match.
  

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Stephen J. Turnbull
Jeff Breidenbach writes:

   Notice that of 325146 total messages, 624 of them had no message-id
   header.  Even if you aggregate dup+col, you're still looking at a
   total duplicate rate of 0.29%.
  
  Message ID's are supposed to be unique.

Fortunately, a rule more honored in the observance than the breach.
Nonetheless, it *is* breached.  The Postel Principle applies here, IMO.

  better to go ahead and use the mesage-id, rather than concoct
  yet another this time we mean it! unique identifier.

That's not the point.  We're not going to impose this on senders;
that's what Message-ID is for, as you say.  If a sender won't provide
a proper Message-ID, third parties who get a CC are just out of luck.

I simply think we should be prepared for applications where relying on
the sender to supply a UUID is not acceptable; we need to be able to
provide one ourselves.  Creating UUIDs is a solved problem, after all.
So we just specify a header to put it in, and subscribers will be able
to use it, per definition of a canonical URL.

Then we say that an archive SHOULD provide access to the resource via
Message-ID if available, and define how to construct that URL from the
List-Archive and Message-ID headers.

  Which brings me to suggestion #2, which is go ahead and write
  an RFC on how list servers should embed archival links in messages.

I think Barry already suggested that?  Anyway, +1.  But remember, a
standards-track RFC should have a working implementation to point to.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread John A. Martin
 st == Stephen J Turnbull
 Re: [Mailman-Developers] Improving the archives
  Tue, 24 Jul 2007 15:56:35 +0900

st Jeff Breidenbach writes:
  Notice that of 325146 total messages, 624 of them had no
  message-id header.  Even if you aggregate dup+col, you're
  still looking at a total duplicate rate of 0.29%.

 Message ID's are supposed to be unique.

st Fortunately, a rule more honored in the observance than the
st breach.  Nonetheless, it *is* breached.  The Postel Principle
st applies here, IMO.

Taking be conservative in what you do as being at least as important
as be liberal in what you accept from others, the devil can quote
this scripture to support simplicity in this instance, IMHO.

 better to go ahead and use the mesage-id, rather than concoct
 yet another this time we mean it! unique identifier.

st That's not the point.  We're not going to impose this on
st senders;

I read the quote as meaning this time we mean it really is unique,
imposing nothing on senders.

st that's what Message-ID is for, as you say.  If a sender won't
st provide a proper Message-ID, third parties who get a CC are
st just out of luck.

Right.  Maybe that will encourage compliance.  The complexity of
catering to brokenness in this instance may be too high a price to
impose on the all.

jam


pgpVlVlfc9EJj.pgp
Description: PGP signature
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Stephen J. Turnbull
John A. Martin writes:

   better to go ahead and use the mesage-id, rather than concoct
   yet another this time we mean it! unique identifier.
  
  st That's not the point.  We're not going to impose this on
  st senders;
  
  I read the quote as meaning this time we mean it really is unique,
  imposing nothing on senders.

Ah.  If so, my reply is if you want something done right, do it
yourself.  *All robust databases assign a unique ID to each record.*
Why shouldn't a mailing list archive do so?

  Right.  Maybe that will encourage compliance.  The complexity of
  catering to brokenness in this instance may be too high a price to
  impose on the all.

What complexity?  Mailman just does

   msg['X-List-Archive-Received-ID'] = Email.msgid()

(or however the message ID generator is spelled).  After that, it's up
to the archiver whether to do anything with it or not.  I proposed a
way that it could be used; if that's considered too complex, fine.
But simply assigning one is not complex or otherwise very costly.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
There are three different parties coming to the table. One is
the mail transfer agent of the sender, another is the list server,
and the third is the archive server. Ideally, all three will be happy
campers.

So we just specify a header to put it in, and subscribers will be able
to use it, per definition of a canonical URL.

It is the archive server's job to decide what is the canonical URL
for a message. There's a good chance these archival URLs will be
served by an HTTP redirect. So let's not use the word canonical. :)

What complexity?  Mailman just does

  msg['X-List-Archive-Received-ID'] = Email.msgid()

Easy to introduce, harder to deal with. The archival server would now
keep track of both the message-id and the x-list-archive-received-id.
That's two namespaces that almost do the same thing. It's easier
for the archive server to keep track of one name space than two,
and - most importantly - conceptually simpler.

From the perspective of the assorted list servers, it's easier to
do nothing than to do something. So if they can get by with
just message-id (which is already implemented) not have to add
x-list-archive-received-id, that's a smoother implementation path.
If we base on message-id, archival servers will be able to
retroactively add support for all their stored messages, even those
that are ten years old. And users holding an old message will be
able to figure out that URL without doing any computational
gymnastics.

Put another way, there's the possibility to reduce the archive
servers' implementation to search for this mesage-id which is
something really useful to have anyway, and therefore likely to
get wider support.

In addition, Barry was talking about concocting a unique
identifier from the Date field and Message-ID. I'm not a big fan of
this idea, because the date field comes from the mail user agent
and is often wildly corrupt; e;g; coming from 100 years in the future.
Very painful if the archive is showing most recent message first.
Therefore an archival server is very likely to determine message date
from the most recent received header (generally from a trusted mail
transfer agent) rather than the date field. From the archive server's
perspective, the best thing to do with the date field is throw it away.

So for these reasons, I'd rather stick with message-id and risk
some real world collisions, instead of introduce another identifier.
If the list server receives a message with no message-id, by all means
create one on the spot.  To me, this feels like the sweet spot in terms
of cost benefit. The main thing that bugs me is message-ids are long,
which makes them awkward to embed in a URL in the footer of a
message.

Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Dale Newfield
Jeff Breidenbach wrote:
 In addition, Barry was talking about concocting a unique
 identifier from the Date field and Message-ID. I'm not a big fan of
 this idea, because the date field comes from the mail user agent
 and is often wildly corrupt; e;g; coming from 100 years in the future.

Oh--I was assuming the Date to which he was referring was the current 
timestamp at which mailman was processing the message.  I was going to 
say that this guarantees uniqueness, but I guess there are parallel 
mailman implementations where more than one machine/processor are all 
serving the same list, and then two different machines/processors might 
wind up with identical timestamps while processing two different messages.

-Dale
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Terri Oda
On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote:
 So we just specify a header to put it in, and subscribers will be  
 able
 to use it, per definition of a canonical URL.
 It is the archive server's job to decide what is the canonical URL
 for a message. There's a good chance these archival URLs will be
 served by an HTTP redirect. So let's not use the word canonical. :)

Someone already pointed out that the message ID is a bit long for a  
URL, so I'm guessing we're going to want some sort of shorter  
sequence number for messages for linking purposes.

Regardless of whether we *need* to generate our own unique ID, I'm  
leaning towards the thought that we're going to *want* to generate  
our own for usability reasons.  In a perfect world, i think we'd have  
a sequence number so I could visit http://example.com/mailman/ 
archives/listname/204.html and know that 205.html would be the next  
message to that list, but any short unique id would do if sequence  
numbers are too much of a pain.

It seems silly to generate nice short links but then use message-id.   
If we can generate nice short links, we might as well use 'em  
throughout, unless you really think the default use of the archive  
will be to search it by messageid (which I sincerely doubt, from my  
user experiences).

  Terri

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
 Regardless of whether we *need* to generate our own unique ID, I'm
 leaning towards the thought that we're going to *want* to generate
 our own for usability reasons.  In a perfect world, i think we'd have
 a sequence number so I could visit http://example.com/mailman/
 archives/listname/204.html and know that 205.html would be the next
 message to that list, but any short unique id would do if sequence
 numbers are too much of a pain.

I agree there's a lot of usability benefits from short URLs, but perhaps
this is the job of the archive server, and not the list server. Mharc (an
archive server) is a great example here. Mharc's canonical message
format is pretty human friendly.

http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg0.html

Unfortunately, there's no trivial way for the list server to know that human
friendly URL when the message is sent out. Fortunately, Mharc is also
happy handles messages by message-id, which the list server does know
about.

http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc-users[EMAIL PROTECTED]

Had I been the implementer, I'd probably have made mharc do an HTTP 302
redirect from the longer URL to the shorter URL. But that's besides the point.
The point is we have an existing, working, happy archival server, and it would
be really nice if list servers (such as mailman) were compatible. And by
compatible, I mean offering the capability of embedding an archival URL in the
footers of messages.

-Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 22, 2007, at 12:33 PM, Terri Oda wrote:

 On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote:
 I've looked at a few lurker archivers and I wasn't blown away by its
 user interface.  That's apparently highly configurable though.

 I've been doing a lot of thinking about interface, and I'm coming to
 the conclusion that something more like a web bulletin board is
 probably the way to go, given that people use them all the time
 without much trouble and with a fairly minimal amount of whining. ;)

I like this for several reasons.  I've long wanted a bridge between  
the traditional mailing list and a forum because to me they're  
related along a spectrum of emotional investment.

What I mean is this.  For the subjects and projects I care deeply  
about, I join the mailing list.  I want to be intimately involved in  
the day-to-day collaboration that being subscribed gives me.  I care  
enough about that that I'm willing to put up with the pain that comes  
along with mailing lists, such as the overhead for subscribing,  
deleting topics I don't care about, the occasional spam, the overhead  
of going on vacation or leaving the list, etc.

But there are even more topics or projects that I have only a  
fleeting interest in.  Say I find a bug in some X program, or wake up  
and decide to learn how to use setuptools, or find that some recent  
update broke my Linux server.  In all those cases, I might want to  
start a thread of discussion or ask a question, and be very involved  
in that thread for a week or two.  Then, my interest wanes, or I get  
my question answered, or other projects pique my interest.  Mailing  
lists are pretty bad at managing those kinds of fleeting involvement,  
but forums are quite nice.  There's usually fairly low overhead (and  
probably even less if OpenID and such were in widespread adoption)  
for joining, and when I lose interest the forum doesn't fill up my  
inbox.  OTOH, forums seem good for short 'instant' messages, but not  
so good (IMO) for free ranging, detailed discussions.  So there's a  
spectrum.

 I'm trying to use interfaces to things like comment systems (which
 are often threaded -- picture the slashdot stuff, maybe?) and popular
 boards like phpbb (which isn't threaded beyond separate topics) as
 guides to how people usually deal with conversations on the web.

 It'd actually be fairly easy, at that point, to just put a posting
 interface into the archives (yes, you'd have to be logged in, and
 yes, this means your password becomes that bit more valuable because
 someone having it can pose as you to the list... but they could do
 that by spoofing your email address so I'm not too concerned). But
 then people who don't like email or just want to pop by and check the
 list quickly could actually use mailman like a web board, which is
 something I'm pretty sure would get used (I know my users have asked
 for it in the past).

Heck, /I'd/ use it, so what more justification do we need? :)

 I've been drafting simple prototype interfaces in my head, trying to
 keep potential architectures in mind.  I'm hoping I'll have time this
 week to code some up HTML and see how well they actually work when
 they're not just inside my head. :)

I'd love to see the prototypes once you've committed them to HTML.   
The one important thing is that the individual postings will need the  
equivalent of a stable archive URL (i.e. permlink) that could be  
passed around, added to web pages, etc.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZH43EjvBPtnXfVAQLrzQP8CG5ALhX+Wk91I+jri20R60C7cqtCzQby
V9MD8FlhC/7LbRW3QXwJnwWSpXCnBYhShxmRMn2maEeIXqPUEBl3QOcUYkHxeRZG
zV6sKE1J1EZfbUTY7CM3lcnOZKHB1n07PGslcxQsJHEmnbuHbR7bm+2AV2CknzZj
8Y/9XxPjX5Q=
=IRq2
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 24, 2007, at 2:02 AM, Jeff Breidenbach wrote:

 Which brings me to suggestion #2, which is go ahead and write
 an RFC on how list servers should embed archival links in messages.
 This sounds like an internet wide interoperability issue as much as
 something mailman specific. Why not come up with a scheme usable
 by all list servers? And also describe a specification third party  
 archival
 services can comply to. Besides, I've always wanted to help write
 an RFC. If we go that route, it would be good to get input from a  
 range
 of people - one person I'd suggest is Earl Hood, author of mhonarc.

I've always thought that an RFC-like spec that describes how a  
generic mailing list manager would interoperate with a generic  
archiving service is the way to go.  I've written up a somewhat more  
formal spec of what I've implemented MM3 currently here:

http://wiki.list.org/display/DEV/Stable+URLs

If this looks good, I'd be happy to approach some of the related  
communities to try to get buy-in.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZIjHEjvBPtnXfVAQLK9AP/VQveYtFuZhJam9TITYBuMyc8pig7nqDt
efn4DIXhZhgtqBQ58/TgEFZnTkKfiZ1HLdoovrQye8HdKZmuAd+SJrOkq/aO9fIC
ZgaV5HYBD7TcnQuO2z5eRuK3IY7FpWoeZrn/a6sxBObsaSOrOTjhqs1gv5go24d3
8CmG/bB9LTo=
=EyoU
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 24, 2007, at 2:56 AM, Stephen J. Turnbull wrote:

 I simply think we should be prepared for applications where relying on
 the sender to supply a UUID is not acceptable; we need to be able to
 provide one ourselves.  Creating UUIDs is a solved problem, after all.
 So we just specify a header to put it in, and subscribers will be able
 to use it, per definition of a canonical URL.

 Then we say that an archive SHOULD provide access to the resource via
 Message-ID if available, and define how to construct that URL from the
 List-Archive and Message-ID headers.

I think there's two approaches we could argue for.  One is for the  
mailing list manager to craft a UUID out of whole cloth and stick  
that in a header.  Then any downstream archiver would be obliged to  
use that header value as the canonical address of the message, with  
an alternative path to the message via the Message-ID (possibly  
returning a list of matching messages when there are collisions).

The second approach, and the one that I favor, is to use the Message- 
ID (and the Date) header on the original message as the UUID,  
properly handling corner cases like duplicate headers or missing  
header.  This UUID servers as the basis for the address to the  
message resource just like above.

I like the second approach better because in the case where you start  
with an off-list copy of the message, you have a decent enough chance  
of getting to the archived message, or at least to a resource  
containing a link to the message.  The first alternative would  
require access to the list copy.

Imagine if every archiver supported my proposal, knowing just the  
Message-ID and Date header, you could get to that message from almost  
anywhere, just by using the UUID as a relative URL rooted at say  
http://www.mail-archive.com, http://groups.google.com, http:// 
mail.python.org/pipermail, or whatever.  That would be pretty neat.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZKpnEjvBPtnXfVAQJWcwP6A6SqHTeft+c/5IeSpRsI+gvtPJW94fcG
pjB66oYiKco7U+rZtxll3TPD9Ta7gccohq72sh8hV7CHRW7Cd531Hq91z7QktHUW
zqzxkMimoca7WlUxr0/ElyPNhRkjMlR8LvhNCjs4a9O6/PpzBTNjsXwaTKfLrqO3
N5iq3BWoMK8=
=fSNC
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote:

 What complexity?  Mailman just does

  msg['X-List-Archive-Received-ID'] = Email.msgid()

 Easy to introduce, harder to deal with. The archival server would now
 keep track of both the message-id and the x-list-archive-received-id.
 That's two namespaces that almost do the same thing. It's easier
 for the archive server to keep track of one name space than two,
 and - most importantly - conceptually simpler.

True, but an archiver already has to handle collisions on the Message- 
ID so in a sense, you have to maintain multiple paths to the same  
message, don't you?

So I like my proposal because it imposing nothing additional on the  
MUA or MTA, a tiny bit more on the MLM, and some extra work (though I  
think not much) on the archiving agent.  What you gain from my  
proposal over a pure Message-ID approach is guaranteed uniqueness  
given the list copy, and human friendlier urls.

 From the perspective of the assorted list servers, it's easier to
 do nothing than to do something. So if they can get by with
 just message-id (which is already implemented) not have to add
 x-list-archive-received-id, that's a smoother implementation path.
 If we base on message-id, archival servers will be able to
 retroactively add support for all their stored messages, even those
 that are ten years old. And users holding an old message will be
 able to figure out that URL without doing any computational
 gymnastics.

All these are still true with my proposal, except with the  
observation as Stephen points out that given a URL based on sender- 
provided headers, you must be prepared to deal with collisions, so  
sometimes your resources will return lists.  The advantage of adding  
a bit of MLM-provided information is that given the list copy you can  
guarantee uniqueness, and given the off-list copy you can get to a  
resource that contains a link to the message you want.

 Put another way, there's the possibility to reduce the archive
 servers' implementation to search for this mesage-id which is
 something really useful to have anyway, and therefore likely to
 get wider support.

 In addition, Barry was talking about concocting a unique
 identifier from the Date field and Message-ID. I'm not a big fan of
 this idea, because the date field comes from the mail user agent
 and is often wildly corrupt; e;g; coming from 100 years in the future.
 Very painful if the archive is showing most recent message first.
 Therefore an archival server is very likely to determine message date
 from the most recent received header (generally from a trusted mail
 transfer agent) rather than the date field. From the archive server's
 perspective, the best thing to do with the date field is throw it  
 away.

Throw it away or hide it?  The former would be a problem, but not the  
latter.  Does your archiver keep a canonical copy of the message as  
you received it?  If so, then you preserve the original Date header  
enough for the calculation to occur, even if you hide the Date  
header, or display a Received header date when you render it to  
HTML.  That doesn't matter of course.

But I should point out that I'm not married to including the Date  
header in the hash.  I like it because it appears to reduce  
collisions which I care about.  But I still like using the base32  
sha1 hash instead of the raw Message-ID because I think it's easier  
for humans to use, read, speak, and copy.  Of course this doesn't  
mean that you need to disable your search-by-Message-ID feature!

 So for these reasons, I'd rather stick with message-id and risk
 some real world collisions, instead of introduce another identifier.
 If the list server receives a message with no message-id, by all means
 create one on the spot.  To me, this feels like the sweet spot in  
 terms
 of cost benefit. The main thing that bugs me is message-ids are long,
 which makes them awkward to embed in a URL in the footer of a
 message.

Another advantage for the URL scheme I propose.  You know you're  
going to end up with URLs of len(host-prefix) + 32 + 1 + #digits-in- 
seqno

(32 == base32(sha1digest(data))
(1 == / divider)
(#digits-in-seqno == e.g. len(str(seqno))

You should be able to keep things in the 60-70 character range,  
including the host name.  That doesn't seem too bad.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZO4HEjvBPtnXfVAQIYGwP/VZPCiQrg9CTeMThApNTh7xUismbW0AiT
1N6a8DusXDBrqiLDQd+v2/R5KOV+TnwDNlIcl5FfFatHxWJ0bGy850kT/nhrHdKU
UrW0hR8PWSMIRN5Bqx9bL9cvaMigAoyX+njAfiDgl0yy7arbAm66GH1HNH3c1XGT
1/qaGckINUg=
=4uwH
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Stephen J. Turnbull
Jeff Breidenbach writes:

  So we just specify a header to put it in, and subscribers will be able
  to use it, per definition of a canonical URL.
  
  It is the archive server's job to decide what is the canonical URL
  for a message. There's a good chance these archival URLs will be
  served by an HTTP redirect. So let's not use the word canonical. :)

If it's not going to be canonical (I forget if there's a standard
for that word :), what is the point in writing an RFC?

  What complexity?  Mailman just does
  
msg['X-List-Archive-Received-ID'] = Email.msgid()
  
  Easy to introduce, harder to deal with. The archival server would now
  keep track of both the message-id and the x-list-archive-received-id.
  That's two namespaces that almost do the same thing.

The implementations are similar, and there is nearly a one-to-one
correspondence.  But the semantics are very different.  Message-ID is
untrustworthy, the internal ID is trustworthy.

  So for these reasons, I'd rather stick with message-id and risk
  some real world collisions, instead of introduce another identifier.

Go ahead and stick with message-id if *you* like, but please don't
tell *me* what risks I have to accept.

There needs to be a way to *enforce* uniqueness, and it *must* be
specified by the RFC in order for archive implementations to be
interoperable.  Note that word specify; I do not insist that this
level of robustness be *required*.  But if we don't specify it now,
people who want such robustness will have to do all this work again,
and possibly will end up with something that some servers conforming
to your RFC will not conform to.

It is possible that most archivers will simply use the message ID, and
do something brutal in the rare case of a collision.  That's fine.
But an archiver that wants to provide a canonical URL which is
guaranteed to uniquely and losslessly identify a post in its archive
should have a standard way to do that.

  The main thing that bugs me is message-ids are long, which makes
  them awkward to embed in a URL in the footer of a message.

The footer URL is of no concern in this discussion.  There is not
going to be a requirement that footer URLs be canonical, not if I
have any say in the matter.  The canonical URL will be in (or be
constructed from) the message header.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
 What you gain from my proposal over a pure Message-ID approach
 is guaranteed uniqueness given the list copy

Guarantee is a pretty strong word. A malicious person could post two
messages with the same message-id, same date, but different bodies.
Sometimes the channel between the MLM and the archive server will
be SMTP, and spurious messages can be injected. Finally, from the archive
server's perspective, some of the MLMs might make mistakes - just like
from the MLM's perspective, some of MTAs might make mistakes in
setting message-id. So I don't think the proposed SHA1(date, message-id)
scheme buys a hard guarantee of uniqueness. Every component has
to protect themselves, but none can solve the world's problems.

So that moves us to how many collisions are reduced in practice.
I have a question about the numbers Barry mined from the python
lists. Are the collisions really that high? One should not count
messages without a message-id, because the MLM can and should
create one in that case.

One should also not count collisions of messages going to different
lists. Here's why. Let's say message M is cross posted to lists L1 and
L2. Even though it is the same message, there are now two different
contexts. (For example, people visit M at archive L1 should get a
completely different experience if they hit next message and people
visiting M at archive L2.)

So I'd be curious what the collision numbers come to with these two
factors taken into account. The other takeaway  is list name really
should be part of the URL to get proper context. The earlier example
from Mharc does this.

 and human friendlier urls.

That's a very compelling point.

SHA1 can't be computed inside someone's head or simple cut-n-pasted
together for old messages,  but I think the usability benefits of short
URLs (short enough that they can comfortably fit inside message bodies)
outweighs this drawback. By the way, is SHA-1 still in favor? My
impression was it was fading away after the Shandong University team
partially cracked it.

 Throw it away or hide [Date]?  The former would be a problem,
 but not the latter.

Thrown away. My favorite archival service is based on mhonarc,
and raw mail goes into offline cold storage. Of course this can be
changed for the future messages with some pain, but there's no
reasonable way for myself (or any other mhonarc users in the
same predicament) to retrofit against Date based URLs. For the
record, here's what mhonarc embeds in each HTML page it
produces because these were considered the important headers.
In this message sent from Australia, the date shows a timezone
of UTC -0700, because it was pulled from the received header.

!-- MHonArc v2.6.15 --
!--X-Subject: [Gossip] Re: green#45;travel resources {webliographies} --
!--X-From-R13: [nephf Z. Saqvpbgg zraqvpbgNlnubb.pbz --
!--X-Date: Wed, 26 Apr 2006 00:27:27 #45;0700 --
!--X-Message-Id: [EMAIL PROTECTED] --
!--X-Content-Type: text/plain --
!--X-Reference: [EMAIL PROTECTED] --
!--X-Head-End--

So my main request is to double check the numbers, see if using
Date really buys as much as one thinks. I'll keep digesting the
other aspects of the wiki page.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp