Re: mailing list archive as mbox

2010-03-07 Thread Alexander Best
Giorgos Keramidas schrieb am 2010-03-07:
> On Sun, 07 Mar 2010 12:08:32 +0100 (CET), Alexander Best
>  wrote:
> > Dan Nelson schrieb am 2010-03-07:
> >> In the last episode (Mar 07), Alexander Best said:
> >> > hi there,

> >> > what are the steps i need to perform to get a copy of the entire
> >> > mailingslist
> >> > archive of lets say freebsd-current@ in mbox format?

> >> Go to ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/
> >> where you
> >> can download weekly gzipped archives of all the mailing lists
> >> since
> >> their
> >> creation.

> > thanks for the hint, but it would take hours to download all those
> > gzipped
> > files, extract them and merge them.

> > i really need ALL the messages of a mailinglist. of course i could
> > use the
> > gzipped files you mentioned if i had some script for downloading
> > extracting
> > and merging all those files for me.

> It's relatively easy to hack one.

wow!!! thanks a billion. that's a great script. i pointed the vars containing
ftp sites at mirrors near me which give me better download speed and will run
the script for freebsd-current@ this night (~850 archives to pull).

thanks again. great job. :-)

alex

> You can get a list of year names from the /archive/ directory itself
> with curl(1) and a small amount of Python plumbing around curl:

> >>> from subprocess import Popen as popen, PIPE
> >>> import re
> >>> yre = re.compile('^d.*\s(\d+)$')
> >>> devnull = file("/dev/null")
> >>> def years():
> ... curl = "curl -o /dev/stdout
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/";
> ... ylist = []
> ... for line in popen(curl, shell=True, stdout=PIPE,
> stderr=devnull).stdout.readlines():
> ... m = yre.match(line)
> ... if m:
> ... ylist.append(int(m.group(1)))
> ... return ylist
> ...
> >>> years()
> [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
> 2004, 2005,
>  2006, 2007, 2008, 2009, 2010]

> Then you can grab a list of the freebsd-current archives by looping
> through the list of years and looking for the list of files that
> match
> the pattern:

> 
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/{year}/freebsd-current/(\d+.freebsd-current.gz)

> Using a pipe to parse the output of curl you can collect a list of
> all
> the files that match this pattern, e.g.:

> >>> def yearfiles(year):
> ... base =
> 
> "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current";
> % year
> ... curl = "curl -o /dev/stdout %s/" % base
> ... flist = []
> ... fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$')
> ... for line in popen(curl, shell=True, stdout=PIPE,
> stderr=devnull).stdout.readlines():
> ... m = fre.match(line)
> ... if m:
> ... flist.append("%s/%s" % (base, m.group(1)))
> ... return flist
> ...
> >>> yearfiles(1994)
> []
> >>> yearfiles(1995)
> 
> ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/1.freebsd-current.gz',
>  ...]

> Concatenating the file lists of all years and fetching each one of
> them
> with curl is then trivial:

> >>> ylist = years()
> >>> ylist
> [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
> 2004, 2005, 2006, 2007, 2008, 2009, 2010]
> >>> flist = []
> >>> for y in ylist:
> ... f = yearfiles(y)
> ... flist = flist + f
> ...
> >>> len(flist)
> 785

> Once you have the list of all the remote gzipped files, you can loop
> through the list of files once more and fetch them locally.  I'm only
> going to fetch the first two files here, but feel free to fetch all
> of
> them in your version of the script:

> >>> flist = flist[:2]
> >>> flist
> 
> ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz',
>  
> 'ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz']


> >>> from subprocess import call
> >>> def getfile(url):
> ... out = os.path.basename(url)
> ... retcode = call(["curl", "-o", out, url], stderr=devnull)
> ... if retcode == 0:
> ... print "fetched %s" % url
> ... return tuple([url, out, retcode])
> ...
> >>> map(getfile, flist)
> fetched
> 
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz
> fetched
> 
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz
> ...
> 
> [('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz',
> '19950101.freebsd-current.gz', 0),
>  
> ('ftp://ftp.freebsd.org/pub/FreeBSD/doc/ma

Re: mailing list archive as mbox

2010-03-07 Thread Giorgos Keramidas
On Sun, 07 Mar 2010 12:08:32 +0100 (CET), Alexander Best  
wrote:
> Dan Nelson schrieb am 2010-03-07:
>> In the last episode (Mar 07), Alexander Best said:
>> > hi there,
>
>> > what are the steps i need to perform to get a copy of the entire
>> > mailingslist
>> > archive of lets say freebsd-current@ in mbox format?
>
>> Go to ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/
>> where you
>> can download weekly gzipped archives of all the mailing lists since
>> their
>> creation.
>
> thanks for the hint, but it would take hours to download all those gzipped
> files, extract them and merge them.
>
> i really need ALL the messages of a mailinglist. of course i could use the
> gzipped files you mentioned if i had some script for downloading extracting
> and merging all those files for me.

It's relatively easy to hack one.

You can get a list of year names from the /archive/ directory itself
with curl(1) and a small amount of Python plumbing around curl:

>>> from subprocess import Popen as popen, PIPE
>>> import re
>>> yre = re.compile('^d.*\s(\d+)$')
>>> devnull = file("/dev/null")
>>> def years():
... curl = "curl -o /dev/stdout 
ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/";
... ylist = []
... for line in popen(curl, shell=True, stdout=PIPE, 
stderr=devnull).stdout.readlines():
... m = yre.match(line)
... if m:
... ylist.append(int(m.group(1)))
... return ylist
...
>>> years()
[1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
 2006, 2007, 2008, 2009, 2010]

Then you can grab a list of the freebsd-current archives by looping
through the list of years and looking for the list of files that match
the pattern:


ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/{year}/freebsd-current/(\d+.freebsd-current.gz)

Using a pipe to parse the output of curl you can collect a list of all
the files that match this pattern, e.g.:

>>> def yearfiles(year):
... base = 
"ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current";
 % year
... curl = "curl -o /dev/stdout %s/" % base
... flist = []
... fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$')
... for line in popen(curl, shell=True, stdout=PIPE, 
stderr=devnull).stdout.readlines():
... m = fre.match(line)
... if m:
... flist.append("%s/%s" % (base, m.group(1)))
... return flist
...
>>> yearfiles(1994)
[]
>>> yearfiles(1995)

['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/1.freebsd-current.gz',
 ...]

Concatenating the file lists of all years and fetching each one of them
with curl is then trivial:

>>> ylist = years()
>>> ylist
[1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 
2006, 2007, 2008, 2009, 2010]
>>> flist = []
>>> for y in ylist:
... f = yearfiles(y)
... flist = flist + f
...
>>> len(flist)
785

Once you have the list of all the remote gzipped files, you can loop
through the list of files once more and fetch them locally.  I'm only
going to fetch the first two files here, but feel free to fetch all of
them in your version of the script:

>>> flist = flist[:2]
>>> flist

['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz',
 
'ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz']
>>>

>>> from subprocess import call
>>> def getfile(url):
... out = os.path.basename(url)
... retcode = call(["curl", "-o", out, url], stderr=devnull)
... if retcode == 0:
... print "fetched %s" % url
... return tuple([url, out, retcode])
...
>>> map(getfile, flist)
fetched 
ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz
fetched 
ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz
...

[('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz',
 '19950101.freebsd-current.gz', 0),
 
('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz',
 '19950226.freebsd-current.gz', 0)]
>>>

A slightly hackish script that collects all this to a more usable whole
but lacks LOTS of error checking is the following:

#!/usr/bin/env python

from subprocess import call, Popen as popen, PIPE
import os
import re
import sys

devnull = file("/dev/null")
yre = re.compile('^d.*\s(\d+)$')
fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$')

def years():
curl = "curl -o /dev/stdout 
ftp://ftp.freebsd.org/pub/

Re: mailing list archive as mbox

2010-03-07 Thread Alexander Best
Dan Nelson schrieb am 2010-03-07:
> In the last episode (Mar 07), Alexander Best said:
> > hi there,

> > what are the steps i need to perform to get a copy of the entire
> > mailingslist
> > archive of lets say freebsd-current@ in mbox format?

> Go to ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/
> where you
> can download weekly gzipped archives of all the mailing lists since
> their
> creation.

thanks for the hint, but it would take hours to download all those gzipped
files, extract them and merge them.

i really need ALL the messages of a mailinglist. of course i could use the
gzipped files you mentioned if i had some script for downloading extracting
and merging all those files for me.

alex
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: mailing list archive as mbox

2010-03-06 Thread Dan Nelson
In the last episode (Mar 07), Alexander Best said:
> hi there,
> 
> what are the steps i need to perform to get a copy of the entire mailingslist
> archive of lets say freebsd-current@ in mbox format?

Go to ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/ where you
can download weekly gzipped archives of all the mailing lists since their
creation.

-- 
Dan Nelson
dnel...@allantgroup.com
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


mailing list archive as mbox

2010-03-06 Thread Alexander Best
hi there,

what are the steps i need to perform to get a copy of the entire mailingslist
archive of lets say freebsd-current@ in mbox format?

cheers.
alex
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"