Re: [Mailman-Developers] Improving the archives

2007-07-26 Thread Jeff Breidenbach
> If you are relying on the sender to do the right thing, then
> why not force them to create proper message-ids?

I think Barry's proposal is essentially a numbers game - e.g.
he's hoping for significantly better results using "Date" in
the calculation than not using it.

http://wiki.list.org/display/DEV/Stable+URLs

I'll try to tease out some more useful stats from some large
datasets this weekend. (I can't just run the python scripts as is
because I don't have python 2.5 in the same place as the data,
I don't keep raw message in mbox format, blah blah blah, but
we'll figure it out).

My hypothesis is "Date" doesn't really buy much, but that's
in part because I have a vested interest in that outcome.
We'll see how the data plays out. And I still think RFC2369
headers are needed in the calculation if cross posted
messages are to be handled correctly.

Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-26 Thread Dale Newfield
Jeff Breidenbach wrote:
> So I just looked at 2 million raw messages from 2007, spread over
> a few thousand mailing lists (all data is from mail-archive.com). My
> first question was - when comparing only with messages from the
> same list - how many times do I see a repeated message-id? The
> answer was ... drumroll please ... 260 thousand. What the hell?

I think the question you were originally going to ask got sidetracked. 
If we assume that all these "multiple paths from list to archive" 
duplicates not only share a Message-ID but also a Date (they were the 
same message originally, so they should!), then both schemes (messageid, 
and messageid+date) would decide that all (but one of) these messages 
are redundant.

What we really want to know is how many (non-empty) Message-ID 
collisions are there that *don't* share a Date?  This is the number of 
messages that only-messageid loses, and that the composite identifier 
method would not lose.

-Dale
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp