On 4/2/06, Fredrik Lundh <[EMAIL PROTECTED]> wrote: > > > Fredrik, if you would like to help move this all forward, great; I > > > would appreciate the help. You can write a page scraper to get the > > > data out of SF > > > > challenge accepted ;-) > >
Woohoo! > > http://effbot.python-hosting.com/browser/stuff/sandbox/sourceforge/ > > > > contains three basic tools; getindex to grab index information from a > > python tracker, getpages to get "raw" xhtml versions of the item pages, > > and getfiles to get attached files. > > > > I'm currently downloading a tracker snapshot that could be useful for > > testing; it'll take a few more hours before all data are downloaded > > (provided that SF doesn't ban me, and I don't stumble upon more > > cases where a certain rhettinger has pasted binary gunk into an > > iso-8859-1 form ;-). > > alright, it took my poor computer nearly eight hours to grab all the > data, and some tracker items needed special treatment to work around > some interesting SF bugs, but I've finally managed to download *all* > items available via the SF tracker index, and *all* data files available > via the item pages: > > tracker-105470 (bugs) > 6682 items > 6682 pages (100%) > 1912 files > tracker-305470 (patches) > 3610 items > 3610 pages (100%) > 4663 files > tracker-355470 (feature requests) > 430 items > 430 pages (100%) > 80 files > > the complete data set is about 300 megabytes uncompressed, and ~85 > megabytes zipped. > > the scripts are designed to make it easy to update the dataset; adding > new items and files only takes a couple of minutes; refreshing the item > information may take a few hours. > > ::: > > I've also added a basic "extract" module which parses the XHTML > pages and the data files. this module can be used by import scripts, > or be used to convert the dataset into other formats (e.g. a single > XML file) for further processing. > > the source code is available via the above link; I'll post the ZIP file some- > where tomorrow (drop me a line if you want the URL). > Wonderful, Fredrik! Thank you for doing this! When the data is available I will arrange to get it put on python.org somewhere and then start drafting the tracker announcement with where the data is and how to get at it. -Brett _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com